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PREFACE 


The  purpose  of  this  report  is  to  describe  some  of  the  factors  that 
should  be  taken  into  account  in  the  construction  of  a  battery  of  human 
information  processing  tests. ^  This  report  is  concerned  primarily  with 
batteries  that  will  be  administered  in  a  repeated-measures  paradigm  although 
some  of  the  sections,  such  as  Implementation  Problems,  pertain  to  the  con¬ 
struction  of  ary  battery.  This  report  is  intended  for  individuals  who  have 
a  limited  knowledge  of  human  information  processing  and  the  pitfalls  asso¬ 
ciated  with  computerized  testing.  It  is  designed  in  part  to  supplement 
information  previously  published  in  professional  journals  and  books  on  the 
properties  of  various  information  processing  tests.  For  this  reason,  no 
data  are  presented,  and  detailed  descriptions  of  each  test  are  omitted  since 
these  are  available  elsewhere  (e.g.,  the  Unified  Tri-service  Cognitive 
Performance  Assessment  Battery  (1)  documentation).  The  strengths  and  weak¬ 
nesses  of  each  test  are,  however,  described  in  some  detail  with  specific 
implementation  problems.  The  author  has  assumed  that  the  reader  will  be 
assembling  a  battery  from  tests  that  are  currently  available.  Therefore,  no 
discussion  of  the  development  and  verification  of  new  performance  tests  is 
provided . 

This  report  is  divided  into  tuo  major  sections.  The  first  section, 
Chapters  1-4,  discusses  general  problems  associated  with  the  development  of 
an  information  processing  battery.  These  include  test  selection,  methodo¬ 
logical  issues,  and  implementation  problems.  The  second  major  section, 
Chapters  3-8,  describes  the  two  major  types  of  tests  that  are  included  in 
information  processing  batteries:  rate-of-information-processing  tests  and 
tests  of  higher  processes.  The  most  commonly  used  tests,  in  both  of  these 
categories  are  described.  Other  types  of  tests,  such  as  verbal  reasoning 
or  spatial  visualization,  were  not  inxluded  in  this  report  because  rela¬ 
tively  few  computerized  versions  of  them  have  been  constructed. 

A  glossary  of  terms  is  given  at  the  end  of  this  report.  This:  glossary 
is  intended  primarily  for  readers  with  a  limited  knowledge  of  the  termi¬ 
nology  used  in  a  human  information  processing  context.  The  definitions 
in  the  glossary  are  specific  to  this  monograph  and  should  not  be  construed 
as  general  or  exhaustive  definitions  of  rhe  terms. 


'll)  l  e  1 1  g !  I'J  U  l  Ini'.  i  »  ;  ■  ■  t  I  .  ill-  .  s  *  '  i  1  '■  )  s  It)  l  he  i  -  I  ■  i  • 1 1  1  i  >  . 

periormance  batte.'  y  m  to  an  ms  i.rumen  t  usmi  Lo  obtain  ia  La  on  an  Indi¬ 
vidual.  The  term  "tu.sk"  riders  to  .sped  lie  instruments  or  lasses  of 
instruments,  such  as  the  Sternberg  task  or  rate-of-information-processing 
ta  cks. 
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1.  ISSUES  IN  RVTTKRY  CONSTRUCTION 

A  battery  is  nothing  more  than  a  set  of  tests  that,  as  a  whole,  measure 
certain  skills,  abilities,  and  processes^  The  investigator's  first  job, 
therefore,  is  to  select  a  subset  of  te^ts  from  those  available  that  will 
measure  the  skills,  abilities,  or  processes  of  interest.  In  some  cases,  it 
will  be  necessary  to  select  tests  that  measure  a  broad  spectrum  of  human 
information  processing  skills,  abilities,  and  processes.  Such  'broad  spec¬ 
trum*  batteries  usually  are  needed  in  two  situations:  (1)  when  little  is 
known  about  the  effect  of  the  experimental  factor  on  human  information 
processing,  or  (2)  when  the  investigator  is  interested  in  performance  on 
some  real-world  activity  that  requires  a  broad  range  of  information  proces¬ 
sing  skills  and  abilities.  Flying  is  an  excellent  example  of  the  latter 
situation  because  almost  every  known  skill,  ability,  and  process  is  required 
CCf^  y  by  some  aspect  of  flight. 

Vc  .  ) 

v  •  'Narrow-spectrum'  batteries  are  required  when  a  given  experimental  fac¬ 

tor  is  known  to  affect  only  certain  skills,  abilities,  and  processes.  For 
example,  alconol  has  been  found  to  disrupt  response  processes  but  to  have  no 
significant  effect  on  memory  retrieval  processes  (2).  A  scientist  concerneo 
with  predicting  the  effects  of  alcohol  on  an  activity  requiring  fine  motor 
coordination  would  construct  a  battery  with  several  tests  of  fine  motor 
coordination  and  few,  if  any,  tests  of  memory  retrieval.  Narrow-spectrum 
batteries  are  also  used  when  the  activity  of  interest  requires  relatively 
few  skills  and  abilities.  For  example,  most  sonar  operator's  activities 
requira  visual  perception  and  pattern  comparison  skills  but  relatively  few 
fine  motor  skills.  Thus,  a  battery  for  testing  the  effect  ol:  some  experi¬ 
mental  factor  on  a  sonar  operator's  performance  would  include  few,  if  any, 
tests  of  fine  motor  skills. 


Currently,  an  investigator  has  at  least  three  ways  to  select  tests  for 
either  a  narrow-  or  broad-spectrum  battery  that  is  concerned  with  a  specific 
real-world  activity.  First,  the  investigator  can  abstract  the  skills  and 
abilities  needed  to  perform  the  activity  successfully  from  the  appropriate 
task  analyses.  The  major  problem  with  this  approach  is  that  most  task 
analyses  only  describe  observable  events.  As  a  result,  determining  which 
information  processing  skills  and  abilities  could  have  resulted  in  the 
observed  behavior  is  often  almost  impossible. 

Second,  the  investigator  can  select  tests  that  correlate  with  known 
predictors  of  thfi  activity  or  use  the  predictors  themselves  in  the  test 
battery.  For  example,  suppose  scores  on  a  general  intelligence  test  cor¬ 
relate  with  scores  on  a  paper1  and-pencil  test  of  spatial  reasoning  and  that 
the  spatial  reasoning  scores  predict  performance  in  flight  training.  To 
construct  a  test  battery  to  predict  success  in  flight  training,  the  investi¬ 
gator  could  include  either  the  general  intelligence  test  or  the  spatial 
reasoning  test.  This  approach  has  several  problems.  The  major  one  is  that 
performance  on  very  few  real-world  activities  is  accurately  predicted  by 
either  paper-and-pencil  tests  or  computerized  tests.  Thus,  this  approach 
could  be  used  only  for  a  very  few  activities. 

Third,  the  investigator  can  rely  on  personal  knowledge  of  the  activity 
to  identify  the  required  skills  and  abilities  and  select  tests  that  measure 
these  skills  and  abilities,  Although  many  pitfalls  are  associated  with  this 
approach,  it  is  often  the  one  used  in  battery  development  because  of 
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unavailability  of  adequate  task  analyses  and  the  lack  of  reliable  predictors 
of  performance. 

If  an  investigator  is  interested  in  examining  the  effect  of  a  given 
experimental  factor  on  human  information  processing  in  general,  selecting 
tests  for  a  battery  is  much  easier.  In  this  situation,  the  investigator 
usually  will  construct  a  broad-spectrum  battery  and  choose  one  or  two  tests 
from  each  major  category  of  interest.  So,  for  example,  the  investigator  may 
include  in  the  battery  one  verbal  short-term  memory  test,  one  verbal  reason¬ 
ing  test,  one  spatial  short-term  memory  test,  one  spatial  reasoning  test,  et 
cetera.  The  only  common  restriction  in  constructing  such  a  battery  is  the 
total  time  available  to  test  each  subject. 
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2.  TEST  SELECTION 

Once  the  investigator  has  decided  on  either  a  broad-spectrum  or  a 
narrow-spectrum  battery,  there  are  many  questions  to  be  considered  in  selec¬ 
ting  specific  tests  for  the  battery.2  This  chapter  presents  and  discusses 
11  questions  in  approximately  the  order  in  which  they  should  be  considered. 
Some  of  the  questions  pertain  primarily  to  a  repeated-measures  paradigm  and 
may  not  be  of  concern  to  an  investigator  constructing,  for  instance,"  a 
neuropsychological  battery.  These  11  questions  are  suggestive  of  those  the 
investigator  should  bear  in  mind  when  selecting  tests  for  a  battery;  the 
list  is  not  exhaustive.  Additionally,  these  questions  pertain  only  to  tests 
of  information  processing  skills,  abilities,  and  processes.  They  are  not 
directly  applicable  to  tests  of  personality  traits  or  mood  scales. 

1.  Does  the  test  measure  a  specified  skill,  ability,  or  process? 

What  the  test  measures  should  be  clearly  identified  and  should  be  specific. 
Data  should  be  available  either  from  the  developer  or  the  scientific  liter¬ 
ature  demonstrating  that  the  test  does  indeed  measure  its  purported  skill, 
ability,  or  process  (see  Question  3).  Tests  of  'general  information  pro¬ 
cessing'  or  'memory'  are  generally  worthless. 

2.  Does  a  test  measure  the  same  skills,  abilities,  or  processes  as 
other  tests  already  included  in  the  battery?  Multiple  tests  of  a  certain 
skill,  ability,  or  process  should  be  routinely  included  in  narrow-spectrum 
batteries  but  excluded  from  broad-spectrum  batteries  unless  the  skill, 
ability,  or  process  is  extremely  important  for  the  successful  completion  of 
th**  activity."^  For  example,  the  aircrew  selection  battery  under  develop¬ 
ment  at  the  Naval  Aerospace  Medical  Research  Laboratory  (NAMRL)  includes 
several  tests  of  spatial  processes.  These  tests  are  Included  because  spa¬ 
tial  processing  plays  a  critical  role  in  aircrew  performance. 
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The  major  proDlem  in  selecting  a  test  of  a  skill,  ability,  or  process 
concerns  the  poor  intercorrelation  between  tests  purportedly  measuring  the 
same  thing.  The  main  point  to  bear  in  mind  is  that  no  'pure'  tests  of  most 
skills  or  abilities  exist.  For  example,  no  'pure'  test  of  spatial  ability 
has  been  developed  because  other  skills  and  abilities  affect  performance  on 
the  'spatial'  test.  Typically,  if  two  tests  that  purportedly  measure  the 
same  skill  or  ability  correlate  poorly,  a  detailed  analysis  of  the  tests 
will  indicate  that  each  requires  a  number  of  skills  and  abilities  not  re¬ 
quired  by  the  other  test.  In  contrast,  tests  purportedly  measuring  the  same 
processes  typically  correlate  highly  because  processes  are  initially  identi¬ 
fied  by  a  very  rigorous  and  tiine-consum ing  series  cf  experiments. 


ft 


zThe  author  assumes  that  all  subjects  will  be  screened  for  any  basic 
physiological  abilities  required  by  the  tests  of  a  battery,  for  example,  all 
subjects  will  be  screened  for  color  blindness  if  one  of  the  tests  of  the 
battery  requires  color  discrimination. 


^Multiple  tests  of  the  same  skill  or  ability  are  routinely  included  in 
narrow-spectrum  batteries  because,  as  discussed  in  the  next  paragraph,  no 
pure  tests  of  skills  or  abilities  exist.  Therefore,  multiple  tests  of  a 
given  skill  or  ability  are  included  to  increase  the  probability  of  obtaining 
an  accurate  assessment  of  the  desired  skill  or  ability. 
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An  investigator's  objective  in  designing  a  battery  is  to  select  a  test 
that  correlates  most  highly  with  real-world  performance  or  that  is  most 
representative  of  a  general  category  of  -skills,  abilities,  or  processes.  As 
noted  earlier,  few  correlations  between  information  processing  tests  and 
performance  on  a  real-world  activity  are  available.  Therefore,  selecting 
tests  that  require  the  same  skills,  abilities,  or  processes  as  the  activity 
under  study  must  be  based  on  other  considerations,  such  as  the  amount  of 
practice  required  to  reach  differential  stability^1  (see  also  the  appendix) 
or  the  amount  of  baseline  data  available.  Similarly,  selecting  a  test  to 
represent  a  general  category  of  skills,  abilities,  or  processes  should  be 
based  on  a  variety  of  considerations,  including  those  discussed  in  Questions 
1,3, 5, 7,  and  8. 

3.  How  was  the  test  validated?  To  claim  to  have  a  test  of  a  never* 

bef ore-measured  skill,  ability,  or  process,  the  developer  must  complete  an 
elaborate  series  of  experiments.  These  experiments  must  demonstrate  that 
the  n  est  is  affected  by  a  variety  of  experimental  factors  in  the  way 
that  i  be  predicted  if  it  indeed  measured  the  purported  skill,  ability, 

nr  pr  -s.  This  procedure  is  extremely  time-consuming  and  costly  and 

appear  arely  to  be  attempted  outside  t.ne  university  environment.  The  best 
example  of  this  procedure  probably  is  Sternberg's  development  of  a  test  of 
memory  scanning  (3,4). 

To  demonstrate  that  a  new  test  measures  some  skill,  ability,  or  process 
that  is  measured  by  other  validated  tests,  the  test  developer  still  must 
complete  a  fairly  time-consuming  procedure.5  The  developer  must  demonstrate 
that  the  new  teat  correlates  with  the  other  validated  paper-and-pencil  or 
computerized  tests  of  the  same  skill,  ability,  or  process.  Occasionally, 
the  developer  also  must  demonstrate  that  the  new  test  predicts  performance 
in  a  real-world  situation  where  the  skill,  ability,  or  process  is  known  to 
affect  performance. 

4.  Are  baseline  data  available?  The  test  developer  should  at  least 
furnish  means,  standard  deviations,  and  ranges  for  asymptotic  performance 


"A  differentially  stable  task  has  three  characteristics:  (1)  the  mean 
of  the  group's  performance  is  -onstant  or  increasing  in  a  slew,  linear 
fashion;  (2)  the  standard  deviation  of  the  group's  performance  is  constant; 
and  (3)  the  rank  order  of  subjects  is  constant.  Statistical  tests  are  used 
to  determine  differential  stability;  thus,  some  level  of  error  (p  =  .05,  .01, 
etc.)  is  involved  in  asserting  that  the  mean,  standard  deviation,  or  the 
rank  order  of  subjects  is  "constant."  All  tests  of  differential  stability 
are  performed  on  the  intertrial  correlations,  not  on  the  raw  data. 

5 

This  paragraph  raises  the  question  of  why  anyone  would  develop  a  test 
of  some  skill,  ability,  or  process  when  a  validated  test  already  existed. 

The  primary  answer  is  convenience;  many  information  processing  tests  require 
extensive  practice  before  differential  stability  is  reached  (see  Question  5) 
or  require  large  amounts  of  data  for  analysis.  Thus,  all  the  validated 
tests  of  some  skill,  ability,  or  process  may  be  impractical,  to  use  in  an 
applied  situation. 
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for  a  clearly  defined  population.^*  Learning  data  (time  to  asymptote,  time 
to  differential  stability,  or  learning  curve  parameters)  should  also  be 
given.  These  statistics  are  often  not  available  for  the  population  of 
interest.  The  investigator  then  must  decide  either  to  extrapolate  from  the 
existing  data  to  the  population  of  interest  or  to  collect  baseline  data  on 
individuals  from  the  population  of  interest. 

Baseline  data  can  help  the  investigator  implement  the  test.  If  summary 
statistics  obtained  in  the  investigator's  laboratory  do  not  correspond  to 
those  obtained  from  the  test  developer  for  a  comparable  population,  the  test 
may  have  been  incorrectly  implemented.  That  is,  the  differences  may  be 
caused  either  by  programming  errors  or  hardware  problems.  Another  possibil¬ 
ity  is  that  the  'comparable'  populations  are  different.  In  any  case,  the 
existence  of  baseline  data  is  a  valuable  aid  in  the  development  of  the 
ba  ttery. 

A  complete  lack,  of  baseline  data  is  an  extremely  serious  indicant  that 
the  test  has  not  been  adequately  developed.  If  baseline  data  are  not  avail¬ 
able,  the  test  is  either  in  an  extremely  early  3tage  of  development  or  has 
not  been  subjected  to  rigorous  examination,  and  the  nature  of  what  i3  being 
tested  should  be  questioned. 

5.  Bow  much  practice  is  required  foi  the  test  to  obtain  differential 
stability?  Stability  is  not  a  now-or-nevar  state  of  affaits;  it  is  deter¬ 
mined  by  practice.  Therefore,  if  a  test  has  not  reached  differential  sta¬ 
bility  at  some  point,  a  small  additional  amount  of  practice  may  make  it 
stable.  The  major  issue  then  concerns  the  amount  of  practice  necessary  to 
reach  differential  stability.  The  investigator  must  determine  how  much  time 
is  available  for  practice  before  the  experii  ent  begins  and  select  tests  that 
obtain  stability  during  this  period. 

Currently,  the  only  publication  available  describing  the  time  to  sta¬ 
bility  for  a  variety  of  tests  Is  Bittner  et  al.  (6),  This  report  summarizes 
several  years  of  work  on  the  Performance  Evaluation  Tests  for  Environmental 
Research  (PETER)  project.  The  reader  should  remember  that  all  of  the  tests 
were  evaluated  using  one  testing  schedule  and  one  type  of  subject--volun- 
teer,  enlisted  personnel.  In  some  cases,  the  same  subjects  performed  many 
of  the  same  tests.  If  the  schedule  affects  the  time  to  stability,  the 
values  given  in  Bittner  et  al.  may  differ  from  thosa  obtained  using  either 
more  massed  or  more  distributed  testing  schedules.  Similarly,  if  the 


6The  reader  should  note  that  asymptotic  performance  and  differential 
stability  are  not  identical.  Differential  stability  is  described  earlier  in 
the  monograph  and  in  the  appendix  and  is  mathematically  determined.  Asymp¬ 
totic  performance  has  two  meanings.  The  first,  the  more  uncommon,  occurs 
when  a  learning  curve  has  been  fit  and  an  asymptote  has  been  identified 
mathematically.  This  use  of  the  term  "asymptotic  performance"  means  "the 
terminal  level  of  performance  after  an  infinite  amount  of  practice."  The 
more  common  use  of  the  term  asymptotic  performance  (also  called  stable 
performance)  implies  that  the  mean  performance  on  several  consecutive  trials 
did  not  change  or  changed  very  little.  Most  uses  of  this  term  imply  a 
judgment  by  the  investigator  that  performance  would  not  improve  further  with 
practice.  The  riskiness  of  this  assumption  is  demonstrated  by  Bradley  (5). 
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subject  population  affects  the  time  to  stability,  the  figures  in  Bittner  et 
al.  may  not  be  accurate,  and  the  data  should  be  regarded  as  positively 
biased  by  the  experimental  sophistication  of  the  subjects.  Currently,  few 
data  exist  on  the  relation  between  time  to  stability  and  the  testing  sched¬ 
ule,  and  those  data  conflict.  No  data  examine  the  time  to  stability  as  a 
function  of  the  characteristics  of  the  subjects  taking  the  test.  There¬ 
fore,  the  values  given  in  Bittner  et  al.  should  be  regarded  as  estimates  of 
the  time  to  reach  differential  stability,  not  as  absolute  figures. 

Several  other  facts  about  stability  should  be  discussed.  One  of  these 
concerns  tests,  such  as  the  Sternberg  memory  search  task,  that  use  the  slope 
between  the  average  correct  reaction  time  and  a  task  variable  as  a  dependent 
measure.  These  tests  are  among  the  most  carefully  developed  and  theoreti¬ 
cally  important  ones  in  cognitive  psychology  (see  Chapter  6).  Generally, 
Bittner  et  al.  (S)  found  that  the  slopes  of  these  tests  did  not  stabilize 
during  the  testing  period  or  stabilized  very  late  in  testing.  Thus,  these 
tests  may  require  a  great  deal  of  practice  before  they  can  be  used  effec¬ 
tively. 

A  second  fact  concerns  tests  with  multiple  dependent  measures,  such  as 
mental  arithmetic  tasks  that  use  average  correct  reaction  time  and  per¬ 
centage  correct.  Commonly,  one  dependent  measure  of  a  test  obtains  sta¬ 
bility  before  another.  In  selecting  tests  for  a  battery,  a  scientist  should 
determine  which  dependent  measures  are  of  interest  and  ensure,  as  much  as 
possible,  that  those  measures  will  become  stable  in  the  time  period  allotted 
for  training, 

A  third  fact  concerns  the  stability  of  task  combinations.  Many  batter¬ 
ies  designed  to  address  applied  issues  include  one  or  more  task  combina¬ 
tions.  Although  f  w  data  ire  available,  performance  under  dual-task  condi¬ 
tions  appears  to  stabilize  very  slowly.  At  this  time,  almost  no  data  show 
the  relation  between  the  stability  of  performance  of  each  task  singly  and 
the  stability  of  performance  of  each  task  under  dual-task  conditions.  Per¬ 
formance  on  eacn  of  the  tasks  apparently  does  not  have  to  be  stable  for  the 
combination  to  be  stable;  a  tracking-mental  arithmetic  combination  investi¬ 
gated  at  NAHPL  had  some  stable  dependent  measures  although  neither  task  was 
staole  when  performed  alone.  Because  of  ihe  lack  of  knowledge  about 
mul tiple- task  stability,  investigators  should  consider  the  value  of  includ¬ 
ing  task  combinations  in  a  battery. 

Finally,  the  investigator  should  consider  the  issue  of  post-stability 
test  definition.  Th^-ore tioally,  differential  stability  does  not  depend  on 
the  magnitude  of  the  intertr  ju  1  correlations;  only  their  consistency  deter¬ 
mines  stability.  Test  definition  is  concerned  with  the  magnitude  of  the 
intertiial  correlations.  Generally,  if  the  average  correlation  is  less  than 
0.7,  the  test  is  said  to  havo  poor  definition  and  is  considered  to  contain 
too  much  (50%)  unpredictable  variance  to  provide  usable  data. 

6.  Is  the  test  sensitive  to  the  experiaental  factor  under  considera¬ 
tion?  Sensitivity  implies  tha*-  at  least  one  experiment  has  demonstrated  a 
statistically  significant  change  in  test  performance  from  the  experimental 
factor  in  question.  Tests  vary  greatly  in  their  sensitivity  to  experi¬ 
mental  factors.  A  given  test  may  be  sensitive,  for  instance,  to  heat  but 
not  to  vibration.  Using  a  sensitive  test  greatly  improves  the  probability 
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tf  investigator  wiil  find  a  statistically  significant  effect  of  the 

(  ntal  factor. 

..  How  many  data  points  can  be  collected  per  unit  time?  Some  tests, 
such  as  vigilance  tests,  generate,  at  best,  one  datum  per  10-inin  period.  In 
contrast,  reaction  time  tests  may  generate  60  or  70  responses  per  minute. 

The  investigator  must  consider  the  time  scale  of  interest  and  select  tests 
that  generate  a  sufficient  amount  of  data  for  analysis  purposes  during  the 
experiment.,  ^ 

8.  How  much  data  will  be  unusable?  Equipment  failure  and  operator 
error  sometimes  result  in  the  loss  of  large  amounts  of  data,  but  these 
events  cannot  be  predicted.  The  dependent  variables  and  the  experimental 
factors  strongly  influence  the  proportion  of  unusable  data,  and  to  some 
extent,  the  proportion  of  unusable  data  can  be  predicted  for  a  given  experi¬ 
mental  situation.  For  example,  many  experimental  tasks  require  only  a 
yes/no  or  a  true/false  response.  For  the  majority  of  these  tasks,  50% 
correct  represents  chance  performance.  If  an  experimental  factor  makes  the 
task  difficult,  the  percentage  correct  could  fall  to  chance  levels.  The 
effects  of  any  subsequent  experimental  manipulations  would  be  almost  impos¬ 
sible  to  detect,  and  the  data  are  usually  discarded.  Some  computerized 
mental  arithmetic  tasks  seem  particularly  susceptible  to  this  type  of  prob¬ 
lem;  the  subject's  response  is  often  simply  scored  as  correct  or  incorrect. 
Some  subjects  have  so  much  trouble  with  this  type  of  task  that  they  make  few 
correct  answers,  and  their  data  are  normally  discarded. 

9.  How  nany  dependent  Measures  does  the  test  h eve?  Tests  with  one 
dependent  measure  must  be  analyzed  using  univariate  analysis  techniques, 
such  as  T  tests  and  £  tests.  These  are  familiar  to  most  investigators  and 
are  generally  easy  to  execute  and  interpret.  Tests  with  multiple  dependent 
measures,  such  as  percentage  correct  and  average  correct  reaction  time,  may 
be  analyzed  using  either  univariate  or  multivariate  techniques,  depending  on 
the  characteristics  of  the  obtained  data  and  the  inclinations  of  the  inves¬ 
tigator  . 

One  school  of  thought  maintains  that  multiple  dependent  measures  ob¬ 
tained  from  a  given  task  should  be  analyzed  routinely  using  multivariate 
techniques.  Then,  if  the  measures  are  uncorrelated,  univariate  analyses  can 
be  used.  Because  this  approach  is  relatively  new,  it  is  somewhat  controver¬ 
sial  and  has  several  drawbacks  associated  with  it.  One  drawback  is  that 
multivariate  analyses  are  less  familiar  to  most  investigators  than  univar¬ 
iate  analyses  and  are  more  difficult  to  interpret.  A  second  drawback  is 
that  the  statistical  power  associated  with  a  multivariate  analysis  is  diffi¬ 
cult  to  determine.  Therefore,  if  a  investigator  fails  to  find  an  expected 
effect,  it  is  difficult  to  determine  if  the  effect  'really'  was  not  there  or 
if  the  power  associated  with  the  analysis  was  just  too  low  to  detect  the 
effect. 


^The  amount  of  data  necessary  for  analysis  purposes  depends,  among  other 
things,  on  the  analyses  to  be  conducted  and  the  amount  of  statistical  power 
the  investigator  desires.  Headers  uncertain  about  the  amount  of  data  to  be 
collected  should  consult  someone  with  statistical  expertise. 
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A  second  school  of  thought  maintains  that,  for  most  applied  work,  multi¬ 
variate  analyses  are  simply  too  difficult  to  interpret  to  be  of  use.  This 
schoo.  attempts  to  collapse  multiple  dependent  measures  from  a  given  test 
into  one  derived  score,  such  as  i  information-transmitted  score.  The 
problems  associated  with  this  a;  :oach  are  discussed  below.  In  any  case, 
the  investigator  should  be  aware  hat  using  tests  with  multiple  dependent 
measures  may  require  relatively  sophisticated  analyses  that  can  be  both  time 
consuming  and  expensive. 

10.  Are  tim  dependent  measures  of  a  five n  test  raw  scores  or  derived 
scores?  Typically,  derived  scores--such  as  Z  scores  or  proportion  scores-- 
are  more  difficult  to  analyze  than  raw  scores  because  their  characteristics 
more  frequently  violate  the  assumptions  of  univariate  analysis.  The  author 
has  performed  simple  statistical  analyses  on  both  the  derived  scores  and  the 
raw  scores  from  several  tests  and  found  that  the  results  of  the  analyses 
performed  on  the  derived  scores  were  considerably  different  from  those 
performed  on  the  raw  scores.  Thus,  tests  that  use  derived  scores  as  de¬ 
pendent  measures  should  be  regarded  with  some  caution. 

11.  Are  there  large  Individual  differences  in  performance?  Few  tests 
have  been  studied  in  sufficient  depth  to  identify  consistent  individual 
differences  in  performance  in  a  meaningful  manner.  Some  tests  show  large 
individual  differences  with  a  relatively  normal  distribution  of  scores. 

These  tests  often  are  excluded  from  uue  because  the  large  individual  dif¬ 
ferences  mask  the  effects  of  experimental  variables.  Other  tests  show  large 
individual  differences  with  a  bimodal  distribution  of  scores.  An  investiga¬ 
tor  should  carefully  weigh  the  advantages  of  including  in  a  battery  any  test 
with  a  bimodal  distribution  of  one  or  more  dependent  variables;  varying  the 
proportion  of  one  type  of  subject  over  another  can  result  in  statistically 
different  outcomes. 
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3.  METHODOLOGICAL  ISSUES 


Several  methodological  (procedural)  issues  must  be  given  at  least  some 
consideration  before  the  investigator  selects  the  tests  of  the  battery. 

Three  of  the  most  im pcrtant-- test  order,  pacing,  and  knowledge  of  results-- 
are  discussed  in  this  chapter. 

TEST  ORDER 

One  problem  in  the  development  of  any  test  battery  concerns  the  order  in 
which  the  tests  will  be  presented.  In  selecting  the  order,  the  investigator 
must  consider  two  major  issues:  content  and  carry-over  (sequence)  effects. 

Content  issues  are  concerned  with  sequencing  the  tests  to  avoid  subject 
boredom  or  fatigue.  For  example,  some  tracking  tasks  are  physically  fati¬ 
guing.  The  investigator,  therefore,  may  not  want  to  schedule  another  phys¬ 
ically  fatiguing  task  immediately  before  or  immediately  after  a  tracking 
task.  As  another  example,  classical  vigilance  tasks  generally  result  in 
very  low  arousal  levels.  Indeed,  these  tasks  are  often  so  monotonous  that 
subjects  fall  asleep  while  performing  them.  Thus,  investigators  may  want  to 
schedule  vigilance  tasks  at  the  beginning  of  a  battery  when  the  subjects  are 
most  likely  to  be  alert  rather  than  at  the  end  when  subjects  may  be  both 
physically  and  mentally  fatigued. 

At  this  time,  there  appears  to  be  no  guidelines  for  taking  content 
issues  into  account  in  test  sequencing.  The  investigator  must  rely  strictly 
on  general  knowledge  about  the  characteristics  of  each  test  and  combine  this 
knowledge  with  the  purposes  of  the  experiment  to  determine  a  sequence  that 
introduces  the  smaller:,  possible  number  of  artifacts  into  the  data. 

More  is  known  about  carry-over  effects  than  about  content-related 
dependencies.  The  presence  of  carry-over  effects  is  serious  because  it  is 
not  possible  to  use  certain  experimental  designs  if  carry-over  effects 
exist.  For  example,  the  Latin  square  design  can  be  used  only  when  there  are 
no  carry-over  effects  between  any  of  the  levels  of  the  experimental  factors 
(7).  Unfortunately,  there  is  no  way  to  predict  when  carry-over  effects  will 
occur.  Therefore,  an  investigator  may  conduct  an  experiment  using  some 
design  that  precludes  carry-over  effects  and  find  after  all  the  data  are 
collected  that  these  effects  have  occurred.  In  such  a  situation,  the  data 
cannot  be  analyzed  using  the  specified  design,  and  the  investigator  may  find 
no  satisfactory  way  of  analyzing  the  data.  The  most  conservative  approach 
to  carry-over  effects  is  to  assume  that  they  will  occur  and  then  to  conduct 
pretests  to  determine  their  magnitude.  Simon  (8)  provides  a  good  discussion 
of  statistical  methods  of  identifying  and  controlling  for  carry-over 
effects. 

Most  of  the  current  knowledge  about  carry-over  effects  in  applied  re¬ 
search  has  been  obtained  from  examining  the  simplest  possible  situation: 
two  tests  administered  in  a  counterbalanced  fashion.  Th_;  carry-over  effects 
obtained  from  a  two-test,  counterbalanced  experiment  are  often  called  'asym¬ 
metric  transfer  effects.'  Basically,  asymmetric  transfer  occurs  when  the 
transfer  from  Test  A  to  Test  B  is  not  the  same  as  the  transfer  from  Test  B 
to  Te  s  t  A. 


Poulton  documented  numerous  instances  of  asymmetric  transfer  (9, AO). 
Most  of  these  instances  demonstrate  asymmetric  transfer  between  tracking 
tasks  using  (1)  quickened  versus  unquickened  displays,  (2)  magnified  versus 
unmagnified  displays,  and  (3)  pursuit  versus  compensatory  displays.  More 
recent  work  by  Poulton  has  demonstrated  asymmetric  transfer  between  dif- 
ferent  dual-task  combinations  (11).  Damos  (12)  and  Damos  and  Lyall  (13) 
have  shown  asymmetric  transfer  between  versions  of  the  same  task  combination 
that  differ  only  in  the  response  mode  (manual  or  speech)  used  to  control  one 
of  the  tasks. 

The  major  problem  with  asymmetric  transfer,  like  other  carry-over 
effects,  is  that  it  can  seriously  bias  the  data.  Damos  (12)  and  Damos  and 
Lyall  (13)  demonstrated  that  asymmetric  transfer  effects  can  be  so  large 
that  they  cause  spurious  statistical  effects  or  completely  mask  true  ef* 
fects.  Currently,  there  is  no  way  to  correct  mathematically  for  asymmetric 
transfer  once  it  occurs.  Therefore,  all  data  affected  by  asymmetric  trans¬ 
fer  must  be  discarded;  and,  i the  experimental  design  is  still  usable,  the 
analyses  must  be  recalculated  on  the  remaining  data  with  the  subsequent  loss 
of  statistical  power. 

At  this  time,  there  is  no  reason  to  assume  that  other  types  of  carry¬ 
over  effects  have  less  serious  consequences  for  data  analyses  than  asym¬ 
metric  transfer  effects.  Since  predicting  when  carry-over  effects  will 
occur  is  impossible,  carry-over  effects  must  be  planned  for  in  these  exper¬ 
imental  designs. 

PACING 

For  each  discrete  test  in  a  battery,  the  investigator  must  consider  if 
the  test  should  be  paced  or  unpaced.  Although  a  great  deal  of  literature 
examines  the  effect  of  machine  pacing  on  industrial  workers,  few  of  these 
studies  provide  information  pertinent  to  the  development  of  a  performance 
battery.  One  reason  for  using  a  paced  rather  than  an  unpaced  version  of  a 
test  is  to  simulate  a  real-wo'  >.d  task  more  closely.  Another  reason  is  that 
pacing  may  have  an  alerting  property.  Thus,  the  judicious  use  of  pacing  may 
decrease  the  boredom  caused  by  prolonged  testing.  Finally,  paced  tests  can 
result  in  better  performance  than  unpaced  versions  of  the  same  test  (14). 

There  are  also  a  number  of  reasons  for  not  using  a  paced  version  of  a 
test.  The  first  reason  is  that  the  use  of  a  paced  test  often  adds  an 
additional  dependent  variable;  many  investigators  analyze  the  number  of 
missed  stimuli  in  a  trial  separately  from  either  the  number  of  incorrect  or 
the  number  of  correct  responses.  The  use  of  a  third  variable  complicates 
the  knowledge  of  results  given  to  the  subject  and  subsequent,  data  analyses. 

A  second  reason  is  that  certain  information  processing  stages  may  be 
affected  more  by  the  speed  stress  induced  by  pacing  th>-r  others  (15).  Thus, 
the  paced  and  unpaced  versions  of  a  test  may  differ  in  a  number  of  subtle 
and  unidentified  ways.  Currently,  not  enough  data  are  available  to  identify 
the  stages  differentially  affected  by  speed  stress,  and  there  is  no  way  to 
predict  which  stages  may  be  more  affected  than  others. 

The  third  reason  is  that  many  subjects'  performance  is  disrupted  by 
pacing,  particularly  under  multiple- task  conditions.  This  disruption  may  be 
present  even  when  the  pacing  interval  is  objectively  too  long  to  affect 


performance,  that  is,  some  subjects  may  be  so  distracted  by  the  knowledge 
that  the  test  is  paced  that  their  performance  is  adversely  affected.  Under 
m ul tiple- task  conditions,  paced  combinations  usually  result  in  different 
response  strategies  and  appear  to  be  much  more  frustrating  and  tiring  than 
the  unpaced  version  of  the  seme  combination. 

Fourth,  paced  tests  may  result  in  different  excretion  levels  of'various 
catecholamines  (14)  than  unpaced  versions  of  the  same  test.  This  finding 
cautions  primarily  against  changing  from  an  unpaced  version  of  a  test  to  a 
paced  version  during  the  course  of  the  experiment  although  other  physiologi¬ 
cal  measures--such  as  heart  rate,  respiration  rate,  and  blood  pressure-- 
often  show  no  difference  between  the  paced  and  unpaced  versions  of  a  test 
(see  reference  16  for  an  example). 

KNOWLEDGE  OF  RESULTS 

Another  issue  an  investigator  must  resolve  during  the  design  phase  of  an 
experiment  concerns  knowledge  of  results  (KR).  If  the  subject  is  given  KR, 
the  investigator  must  decide  what  type  of  KR  should  be  presented  and  how 
often  it  should  be  provided.  Fortunately,  the  effect  of  KR  on  motor  and 
simple  cognitive  tasks  has  been  extensively  studied.  (See  reference  17  for 
a  very  basic  review  of  the  terminology  and  general  results  and  18  for  an 
extensive  literature  review.) 

Generally,  KR  has  two  functions.  It  decreases  the  time  to  reach  any 
performance  criteria  established  by  the  investigator,  and  it  maintains  the 
subject's  motivation.  Because  KR  is  beneficial,  it  is  almost  always  pro¬ 
vided  in  human  performance^  laboratory  research.  Knowledge  of  results  is 
routinely  omitted  only  for  vigilance  tasks  and  for  tasks  that  provide  a 
great  deai  of  intrinsic  feedback.  The  reason  KR  is  not  provided  during 
vigilance  tasks  is  because  it  usually  eliminates  the  main  phenomenon  of 
interest,  the  vigilance  decrement.  It  may,  however,  be  presented  at  the  end 
of  a  vigilance  session  to  provide  the  subject  with  a  performance  summary. 
Tasks  providing  large  amounts  of  intrinsic  feedback,  such  as  some  tracking 
tasks  or  risk-taking  tasks,  arguably  do  not  need  KR  for  performance  informa¬ 
tion.  Nevertheless,  KR  may  be  provided  for  these  types  of  tasks  to  maintain 
the  subject's  motivation. 

Therefore,  for  most  experiments,  the  investigator  must  decide  between 
concurrent  (presented  during  the  performance  of  the  task)  and  terminal 
(presented  at  the  end  of  a  tii"l  or  a  session)  KR  and  must  determine  the 
accuracy  (precision)  of  the  KR.  Considerations  pertinent  to  both  these 
decisions  are  described  below. 

Concurrent  Versus  Terminal  KR 


Normally,  deciding  between  concurrent  and  terminal  KR  is  easy;  except  in 
m ul tipie- task  experiments,  human  performance  research  uses  almost 
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One  notable  exception  to  this  is  studies  of  exotic  environments. 
Usually,  KR  is  not  provided  during  the  exotic  environment  because  subjects 
may  be  able  to  develop  strategies  to  compensate  for  the  environment.  The 
subjects,  however,  are  still  usually  trained  with  KR. 
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exclusively  terminal  KR.  In  some  laboratory  experiments,  terminal  KR  may  be 
given  after  every  response  of  a  discrete  task  to  provide  the  subject  with 
immediate  performance  information.  Such  a  presentation  schedule  is  rarely 
used  in  applied  contexts  because  it  requires  too  much  time  and  may  prevent 
the  subject  from  developing  a  response  strategy  or  a  response  rhythm. 
Instead,  investigators  in  applied  situations  tend  to  present  KR  after  a 
trial  (which  may  be  defined  either  as  a  fixed  time  period  or  a  fixed  number 
of  responses),  after  a  block  of  trials,  or  after  a  session.  Sometimes,  good 
reasons  preclude  providing  any  KR  to  the  subject  for  a  given  test. 

As  noted  above,  concurrent  KR  is  used  almost  exclusively  in  multiple- 
task  experiments  to  control  the  priorities  that  subjects  assign  to  the 
tasks.  Few  techniques  for  presenting  concurrent  KR  have  been  developed,  and 
all  of  these  have  serious  drawbacks. 

Gopher  and  North  (19)  developed  one  of  the  few  intermittent  concurrent 
KR  techniques.  If  the  subject's  performance  dropped  below  a  certain  level 
on  one  of  the  tasks,  a  brief  tone  sounded.  The  subject  then  attempted  to 
improve  performance  on  the  associated  task.  Unfortunately,  no  feedback  was 
provided  to  the  subject  indicating  that  performance  on  the  task  had  once 
again  reached  an  acceptable  level.  Thus,  although  this  technique  appears  to 
be  straightforward,  subjects  frequently  confused  about  the  accepta¬ 

bility  of  their  immediate  level  of  performance. 

This  intermittent  technique  was  superseded  by  the  moving  bars  technique, 
a  more  complicated  method  for  presenting  concurrent  KR  (e.g.,  references  20- 
23).  This  method  displays  one  bar  graph  and  one  desired  performance  line 
for  each  task.  The  height  of  the  bar  graph  changes  during  a  trial;  its 
height  reflects  the  subject's  average  performance  calculated  over  some 
period  of  time,  typically  5  or  10  s.  The  taller  the  bar  graph,  the  better 
the  subject's  performance  on  that  task.  The  subject  is  usually  instructed 
to  perform  so  that  the  moving  bar  graphs  reach  or  exceed  the  desired  perfor¬ 
mance  lines  for  their  respective  tasks.  The  experimenter  can  adjust  the 
height  of  the  desired  performance  lines  to  any  level  to  control  the  relative 
priorities  of  the  two  tasks. 

Although  this  technique  sounds  impressive,  it  also  has  several  draw¬ 
backs.  The  most  obvious  is  that  it  requires  a  considerable  amount  of  the 
processing  capacity  of  the  computer  to  calculate  and  adjust  the  height  of 
the  bar  graphs.  The  resolution  of  the  graphics  system  also  must  be  suf¬ 
ficient  to  portray  smooth  movement  rather  than  discrete  jumps  in  bar  graph 
height.  Another  problem  is  that  the  presence  of  the  bar  graphs  may  act  as  a 
third  task  or  a  distraction,  depressing  performance  on  the  two  tasko  of 
interest.  Additionally,  subjects  may  be  more  inclined  to  regard  the  exper¬ 
iment  as  a  game  when  the  bars  are  present;  some  subjects  appear  to  be  much 
more  interested  in  manipulating  the  height  of  the  bar  graphs  than  performing 
as  instructed.  Finally,  the  investigator  must  develop  an  algorithm  or  at 
least  a  rationale  for  calculating  the  momentary  height  of  the  moving  bars 
and  for  setting  the  value  of  the  desired  performance  lines.  No  guidelines 
exist  for  establishing  these  values.  Determining  the  values  of  the  various 
parameters  is  a  time-consuming  process,  and  the  investigator  should  allow  an 
adequate  amount  of  pretest  time  to  experiment  with  the  display. 
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Another  method  for  presenting  concurrent  KR  was  developed  by  S.  Harris 
at  NAMRL  (24).  To  use  this  method,  each  trial  must  be  divideo  into  two 
parts.  Performance  data  on  each  task  are  collected  during  the  first  part  of 
the  trial.  The  trial  is  stopped  at  the  end  of  the  first  part,  and  the  data 
are  then  analyzed  according  to  an  algorithm  determined  by  the  investigator. 
The  results  of  the  analyses  are  displayed  to  the  subject  using  a  circle  with 
one  pointer.  If  the  pointer  points  towards  the  12  o'clock  position,"  the 
subject  has  assigned  the  correct  priorities  to  the  two  tasks  (or  is  distri¬ 
buting  attention  as  intended).  If  the  pointer  is  displaced  to  the  left  of 
the  12  o'clock  position,  then  the  subject  is  favoring  the  left-hand  task  by 
an  amount  proportional  to  the  displacement  of  the  pointer  from  vertical. 
Similarly,  if  the  pointer  is  displaced  to  the  right  of  vertical,  then  the 
subject  has  been  favoring  the  right-hand  task.  After  the  subject  has  3een 
the  KR  display  for  a  short  time,  it  is  erased,  the  trial  resumes,  and  the 
subject  changes  his  performance  to  correct  for  any  displacements  of  the 
pointer  from  the  vertical  position. 

Again,  this  method  has  all  of  the  drawbacks  of  the  moving  bars  technique 
and  one  additional  drawback:  The  trial  is  actually  stopped  during  the 
presentation  of  KR.  Although  this  prevents  the  KR  display  from  distracting 
the  subject,  the  subject  must  re-establish  any  cognitive  or  response  strate¬ 
gies  during  the  second  part  of  the  trial. 

None  of  the  techniques  for  presenting  concurrent  KR  is  completely  satis¬ 
factory.  Research  on  these  techniques  appears  to  have  been  abandoned,  at 
least  temporarily,  because  few  investigators  believe  that  such  techniques 
are  absolutely  necessary  to  control  priorities  that  the  subjects  assign 

to  the  tasks. 

Precision  of  KR 

If  KR  is  used,  the  investigator  must  decide  how  precise  the  information 
given  to  the  subject  should  be,  A  clear  distinction  should  be  made  between 
inaccurate  KR  and  imprecise  KR.  Inaccurate  KR  refers  to  KR  that  is  mislead¬ 
ing,  that  is,  deceitful.  In  most  cases,  investigators  cannot  use  inaccurate 
KR  unless  its  use  has  been  approved  by  the  responsible  human  subjects  com¬ 
mittee.  Inaccurate  KR  is  used  very  rarely  in  human  information  processing 
research;  its  effects  are  often  motivational  and  of  little  immediate  inter¬ 
est.  Imprecise  KR  is  simply  KR  that  is  not  as  accurate  as  the  data.  For 
example,  an  investigator  may  record  reaction  times  to  millisecond  accuracy 
but  present  reaction  time  KR  to  tenths  of  a  second  accuracy.  In  this 
example,  the  KR  is  Imprecise  but  not  misleading.  No  guidelines  are  avail¬ 
able  concerning  the  precision  of  the  KR  presented  to  the  subject.  The 
author's  impression  is  that,  for  simplicity,  most  investigators  present  KR 
that  has  the  same  degree  of  precision  as  the  data. 
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4.  IMPLEMENTATION  PROBLEMS 


This  chapter  deals  with  some  of  the  more  common  problems  that  investi¬ 
gators  encounter  when  they  implement  existing  tests  on  their  own  equipment. 
The  reader  may  wish  to  consult  Behavioral  Research  Methods,  Instrumenta¬ 
tion,  and  Computers  for  current  information  on  other  pertinent  hardware  and 
software  problems.  This  journal  publishes  articles  on  instrumentation, 
testing,  computer  technology,  and  algorithms.  Most  of  the  articles  are 
concerned  with  microcomputers  and  many  deal  with  problems  that  could  occur 
in  the  development  of  a  performance  battery.  The  reader  may  also  wish  to 
consult  Moerland  et  al.  (25)  for  a  discussion  of  the  effects  of  computer¬ 
izing  standard  tests  on  test-retest  reliability,  validity,  and  administra¬ 
tion  time.  All  of  the  problems  discussed  in  this  chapter  should  be  elimi¬ 
nated  when  a  standardized  performance  assessment  battery  implemented  on 
standardized  equipment  becomes  available.  These  pro'  lems  are  grouped  into 
three  major  categories:  hardware,  software,  and  subject  instructions. 

HARDWARE  PROBLEMS 

Most  human  performance  tests  are  developed  in  university  or  specialized 
government  laboratories.  Typically,  these  laboratories  use  equipment  that 
was  developed  specifically  to  assess  human  performance.  Investigators  work¬ 
ing  in  more  applied  settings  often  do  not  have  access  to  comparable  pieces 
of  equipment.  In  most  cases,  substituting  general  apparatus  for  specialized 
apparatus  has  no  effect.  However,  equipment  substitution  can  have  serious 
consequences  for  human  performance  testing  in  at  least  two  instances. 

The  first  of  these  concerns  the  keypads.  Most  keypads  used  in  univer¬ 
sity  laboratories  are  specially  manufactured  for  research  purposes  or  are 
selected  according  to  very  strict  criteria  from  commercially  available 
products.  In  many  applied  situations,  the  investigator  may  be  forced  to  use 
the  standard  QWERTY  keyboard  or  a  keypad  that  comes  with  the  computer. 

These  devices  typically  have  several  problems.  The  keys  are  often  rela¬ 
tively  slow  to  respond  and  may  stick  at  the  contact  point,  causing  very 
distorted  reaction  times  (this  sticking  cannot  always  be  detected  simply  by 
pressing  the  keys  a  few  times).  The  keys  may  also  "bounce."  Bouncing 
occurs  when  the  computer  reads  several  responses  rather  than  one  because  of 
the  tendency  of  the  contact  points  to  deform  repeatedly  after  a  normal  key 
press.  This  problem  can  be  corrected  in  software,  or  the  investigator  can 
purchase  keypads  that  register  responses  by  changes  in  a  magnetic  field. 
Bouncing  may  also  occur  if  the  subject  accidentally  depresses  a  key  for  a 
few  huudred  milliseconds;  the  program  may  read  several  responses  and  store 
incorrect  data.  Finally,  the  QWERTY  keyboards  and  keypads  sold  with  most 
microcomputers  are  not  well  designed  from  a  biomechanical  standpoint.  This 
may  result  in  spuriously  long  reaction  times  from  certain  keys. 

A  second  problem  concerns  the  graphic  systems.  Most  microcomputer 
displays  have  relatively  poor  resolution.  Using  tests  that  require  smooth 
motion  in  either  two-  or  three-dimensional  space  is  almost  out  of  the  ques¬ 
tion  with  most  of  these  systems.  Indeed,  sometimes  the  resolution  of  these 
displays  is  so  poor  that  even  relatively  simple  two-dimensional  figures 
cannot  be  drawn  accurately.  This  becomes  a  serious  problem  for  many  of  the 
best  spatial  tests,  such  as  the  rotated  letters  (figures)  test.  A  recent 
version  of  the  NAMRL  aircrew  selection  battery  had  to  use  letters  rather 
than  figures  for  exactly  this  reason.  Color  is  another  display  problem. 
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A  few  human  performance  teats  require  color  discrimination.  If  only  mono¬ 
chromatic  display  systems  are  available,  then  the  teat  either  has  to  be  dis¬ 
carded  or  modified  to  allow  a  different  type  of  discrimination.  Any  modi¬ 
fications  to  the  test  must  be  thoroughly  tested  and  validated  before  the 
modified  version  can  be  used. 

SOFTWARE  PROBLEMS 

Three  persistent  problems  may  occur  in  developing  software  for  human 
performance  tests.  The  most  dangerous  problem  stems  from  the  small  changes 
the  programmer  makes  to  accommodate  hardware  limitations;  very  small  and  ap¬ 
parently  innocuous  changes  in  stimulus  presentation  or  timing  can  radically 
alter  the  nature  of  the  test.  The  most  common  change  of  this  type  concerns 
stimulus  presentation;  programmers  often  change  from  simultaneous  to  se¬ 
quential  stimulus  presentation  to  accommodate  limitations  in  the  graphics 
system.  The  introduction  of  almost  any  delay  between  the  presentation  of 
two  stimuli  will  require  the  use  of  one  or  more  memory  systems.  If  these 
systems  were  not  required  in  the  original  version  of  the  test,  the  new  and 
the  original  versions  may  have  very  different  characteristics.  Frick  (26) 
presents  a  good  example  of  the  processing  changes  that  occur  when  stimuli 
are  presented  sequentially  rather  than  simultaneously. 

Another  "small"  change  that  occurs  frequently  is  the  size  of  the  stimu¬ 
li.  Programmers  may  inadvertently  change  the  size  of  the  stimuli  when 
modifying  the  software  for  a  new  display.  In  some  cases,  such  changes  will 
have  little,  if  any,  detectable  effect  on  the  subject's  performance.  In 
other  cases,  however,  such  changes  may  have  a  noticeable  effect,  particu¬ 
larly  if  the  stimuli  are  accidentally  reduced  in  size.  For  example,  many 
experimental  variables--such  as  fatigue,  drugs,  and  ambient  illumination-- 
may  reduce  the  subject's  visual  acuity.  The  subject  then  might  have  dif¬ 
ficulty  perceiving  a  stimulus  that  was  accidentally  reduced  in  size  but  not 
one  that  was  the  correct  size.  Such  perceptual  difficulties  might  result  in 
a  variety  of  unanticipated  (and  unwanted)  statistically  significant  perform¬ 
ance  effects. 

The  second  problem  is  related  to  speed.  Most  human  performance  tests 
are  written  in  compiled  languages  and  many  are  written  predominantly  in 
assembly  language.  Programmers  writing  in  a  noncompiled  language  should 
ensure  that  no  response-stimulus  delays  have  been  introduced  in  the  program. 
Stimulus  presentation  also  must  be  checked  to  ensure  that  the  stimuli  are 
presented  in  the  same  fashion  as  in  the  original  version.  The  most  common 
problem  with  microcomputers  is  :-tat,  because  of  the  limitations  of  their 
graphics  systems,  stimuli  are  sometimes  drawn  rather  than  flashed  on  the 
screen.  Drawing  the  stimuli  allows  some  information  to  be  analyzed  immedi¬ 
ately  and  can  change  the  cognitive  processes  required  by  the  test. 

The  third  problem  concerns  the  manner  in  which  the  subject's  response  is 
detected.  Either  interrupt-driven  or  software  timing  loops  can  be  used  to 
detect  a  response.  When  a  program  is  interrupt-driven,  the  program  stops  at 
some  point  until  the  subject  makes  a  response.  A  signal  is  then  sent  from 
the  response  device  to  the  computer,  indicating  that  a  response  has  occur¬ 
red.  The  program  then  processes  the  response  and  performs  other  functions 
until  it  again  stops  to  wait  for  another  response  from  the  subject. 


Many  investigators  believe  that  interrupt-driven  software  provides  the 
most  accurate  measurement  of  reaction  times.  This,  however,  is  not  true; 
most  of  the  variability  in  reaction  time  measurement  occurs  after  a  response 
has  been  detected  and  processed.  Generally,  the  majority  of  the  error  of 
measurement  is  caused  by  variability  in  the  time  required  to  present  the 
next  stimulus  to  the  subject. 

Interrupt- driven  software  has  two  problems.  First,  because  the  program 
waits  for  the  subject  to  respond  to  continue  processing,  fixed-length  trials 
are  impossible  to  obtain.  Typically,  after  the  program  detects  a  response, 
it  checks  the  clock  to  determine  if  the  trial  duration  has  been  exceeded. 

If  it  has,  the  trial  is  stopped  at  this  point.  If  not,  the  program  finishes 
the  remaining  functions  and  again  waits  for  the  subject  to  make  a  response. 
Thus,  the  trial  can  be  stopped  only  after  the  subject  makes  a  response.  If 
the  subject  does  not  respond  for  some  reason,  the  trial  will  go  on  indefi¬ 
nitely.  Second,  interrupt-driven  software  typically  requires  special  re¬ 
sponse  devices  that  signal  the  computer  when  a  response  has  occurred.  Many 
common  laboratory  keypads  and  keyboards  cannot  be  used  as  interrupt-driven 
devices . 

The  second  type  of  reactiou  time  measurement  uses  software  timing  loops. 
Generally,  a  program  using  this  technique  performs  a  number  of  initial 
functions,  presents  a  stimulus  to  the  subject,  and  then  enters  a  software 
timing  loop.  This  loop  may  contain  any  number  of  statements,  but  one  must 
be  a  command  to  check  the  response  device(s).  If  no  response  is  detected, 
the  program  continues  executing  this  loop.  As  soon  as  a  response  is  detec¬ 
ted,  the  program  typically  exits  the  loop  and  reads  the  system  clock  to 
record  the  time  when  the  response  occurred. 

One  major  advantage  of  using  software  loops  to  measure  reaction  times  is 
that  this  technique  can  be  used  to  create  trials  of  specified  durations. 

This  is  done  by  inserting  a  statement  in  the  timing  loop  to  read  the  system 
clock  and  compare  it  to  the  specified  trial  length.  If  the  elapsed  time 
exceeds  the  trial  length,  the  trial  is  terminated  without  a  response  from 
the  subject.  A  second  advantage  of  this  technique  is  that  no  special  re¬ 
sponse  devices  are  required.  The  only  minor  drawback  of  software  timing 
loops  is  that  they  result  in  slightly  more  variability  in  reaction  time 
measurement  than  an  interrupt-driven  approach.  This  occurs  because  most 
timing  loops  contain  a  number  of  statements.  Because  a  response  can  occur 
during  the  execution  of  any  statement  in  the  loop,  the  number  of  statements 
to  be  executed  before  a  response  is  detected  varies.  The  amount  of  variance 
that  occurs  in  measuring  reaction  times  using  this  technique  depends  on  the 
number  of  statements  in  the  loop,  but  the  time  required  to  execute  each 
statement  is  normally  so  small  that  this  source  of  variance  is  trivial 
compared  to  the  variance  associated  with  the  presentation  of  the  stimulus. 

The  interrupt-driven  technique  may  be  combined  with  the  timing  loop 
technique  by  placing  interrupt-driven  statements  in  the  timing  loop.  This 
hybrid  technique  allows  fixed-length  trials  but  requires  the  same  special 
hardware  needed  by  the  normal  interrupt-driven  software.  On  the  whole,  the 
best  approach  for  most  human  performance  research  Is  to  measure  reaction 
times  using  timing  loops  or  a  combination  of  timing  loops  and  interrupt- 
driven  software  rather  than  using  only  interrupt-driven  software. 


SUBJECT  INSTRUCTIONS 


Developing  instructions  for  a  computerized  test  is  frequently  a  time- 
consuming  process.  Typically,  the  instructions  for  most  human  performance 
tests  are  designed  for  one-on-one  interactions.  That  is,  the  experimenter 
reads  or  plays  a  tape  of  the  instructions  to  the  subject  and  allows  the 
subject  to  ask  questions.  This  procedure  is  not  always  practical  in-  applied 
settings  in  which  many  subjects  are  tested  concurrently  and  the  experimenter 
cannot  move  from  subject  to  subject  to  answer  questions.  The  goal  then  in 

applied  settings  is  to  deliver  clear  instructions  automatically  to  the 

subject.  Written  instructions  generally  are  used  rather  than  taped  instruc¬ 
tions  in  applied  settings  because  it  is  easier  for  the  subjects  to  reread 

passages  they  do  nut  understand  than  to  replay  a  tape. 

Written  instructions  are  not  easy  to  develop.  Using  simple  language  and 
including  examples  of  stimuli  and  responses  either  on  the  display  screen  or 
on  loose  sheets  of  paper  placed  near  the  computer  does  help.  If  a  test 
seems  particularly  difficult  for  subjects  to  understand,  a  short  pretest  can 
be  administered.  If  the  subject  does  not  score  above  a  predetermined  cri¬ 
terion,  then  the  experimenter  can  be  notified  to  provide  additional  help. 

The  use  of  a  computerized  test  does  not  diminish  the  need  for  stand¬ 
ardized  procedures  for  interacting  with  subjects.  This  is  particularly  true 
when  more  than  one  individual  will  have  contact  with  the  subjects  in  a  given 
context,  that  is,  there  is  more  than  one  experimenter.  Standardized  proce¬ 
dures  for  obtaining  informed  consent,  introducing  subjects  to  the  testing 
area,  and  answering  questions  should  be  developed  before  any  data  are  col¬ 
lected.  These  procedures  should  be  strictly  followed  to  minimize  any 
experiinenter-induced  biases. 
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5.  CLASSIFICATION  TECHNIQUES  FOR  INFORMATION  PROCESSING  TESTS 

After  deciding  on  a  broad-  or  a  narrow-spectrum  battery  and  addressing 
some  of  the  methodological  issues;  the  investigator  needs  to  select  specific 
tests  for  the  battery.  Currently,  tests  are  classified  using  a  variety  of 
different  schemes.  The  oldest  scheme  classifies  tests  according  to  what  the 
subject  is  required  to  do.  Thus,  there  are  tracking  tasks,  vigilance  tasks, 
choice  reaction  time  tasks,  psychomotor  tasks,  et  cetera.  This  scheme  is 
used  today  only  for  very  well  known  tasks,  such  as  tracking  tasks  or  vigil¬ 
ance  tasks,  because  it  does  not  describe  the  tasks  in  the  detail  required  by 
modern  cognitive  psychology. 

A  second  relatively  new  scheme,  which  is  based  on  Wickens*  Multiple 
Resources  Model  (24),  uses  a  number  of  different  dimensions  to  describe  a 
tests  code  of  processing  (verbal  versus  spatial),  stage  of  processing 
(perceptual  and  central  versus  response),  stimulus  mode  (visual  versus 
auditory),  and  response  mode  (manual  versus  vocal).  This  scheme  is  used 
most  often  to  describe  test  combinations.  Because  it  is  a  relatively  new 
scheme,  it  is  not  yet  commonly  used  in  applied  research.  Additionally,  it 
has  not  been  widely  accepted  in  cognitive  psychology. 

A  third  scheme  identifies  tests  according  to  the  primary  cognitive 
structure  (i.e.,  short-term  memory)  or  process  (memory  retrieval)  they 
purport  to  measure.  This  scheme  is  based  loosely  on  cognitive  psychology 
and  appears  to  be  the  most  widely  accepted  classification  scheme  for  applied 
research  at  present. 

In  the  following  chapters,  tests  are  classified  using  both  the  second 
and  third  schemes  described  above  to  the  extent  possible. 


6.  RATE-OF- INFORMATION -PROCESSING  TESTS 


OVERVIEW 

The  rate-of-information-processing  tasks  are  among  the  most  theoreti¬ 
cally  important  and  widely  used  tests  available  today.  This  category  in¬ 
cludes  the  Sternberg  memory  search  task,  the  Neisser  visual  scan  task,  the 
mental  rotation  task,  and  the  choice  reaction  time  task.  These  four  rate- 
of-information-processing  tasks  defy  easy  classification  using  the  two 
schemes  described  earlier.  Using  Wickena'  Multiple  Resources  Model,  ail 
four  of  these  tasks  require  predominantly  early  rather  than  late  processing. 
Responses  to  any  of  the  four  tasks  may  be  made  either  verbally  or  manually 
and,  except  for  the  mental  rotation  task,  stimuli  may  be  presented  either 
visually  or  auditorily.  The  tasks  require  verbal  processing  code  resources 
except  for  the  mental  rotation  task,  which  requires  spatial  processing  code 
resources.  Using  the  third  scheme,  the  Sternberg  and  Neisser  tasks  require 
some  memory  functions,  the  mental  rotation  task  may  or  may  not  depending  on 
its  implementation,  and  the  choice  reaction  time  task  requires  very  minimal 
memory  functions.  The  mental  rotation  task  is  usually  assumed  to  require 
spatial  processing;  the  other  three  are  assumed  to  require  verbal  proces¬ 
sing. 

TASK  DEVELOPMENT 

Because  all  four  of  these  tasks  are  described  in  detail  in  the  litera¬ 
ture,  no  specific  development  information  will  be  given.  Instead,  some 
background  is  provided  for  each  task. 

Sternberg  Memory  Search 

The  Sternberg  task  (3,4)  probably  is  the  most,  thoroughly  documented 
cognitive  test  in  existence  today.  Extensive  baseline  data  exist,  and 
standard  values  have  been  established  for  its  parameters.  Additionally,  the 
task  is  sensitive  to  the  effects  caused  by  some  toxic  substances,  such  as 
lead  (27). 

Neiaser  Visual  Search 

This  test  (28)  was  developed  using  the  same  approach  and  concepts  as  the 
Sternberg  task.  It  has  been  used  much  less  extensively  in  both  basic  and 
applied  research  than  the  Sternberg  task.  No  standardized  version  of  this 
test  exists.  Consequently,  no  baseline  data  are  available. 

Mental  Rotation 


The  mental  rotation  task  is  a  relatively  new  cognitive  test  that  was 
developed  and  popularized  by  Shepard  and  Cooper  (a  good  overview  of  this 
work  is  given  in  reference  29;  see  also  30  and  3i).  Like  the  Sternberg 
task,  this  is  a  theoretically  well  developed  test  that  is  supported  by  a 
comprehensive  and  thorough  body  of  literature.  Unlike  the  Sternberg  task, 
however,  no  standard  vaLues  of  its  parameters  are  available  because  the 
rates  of  rotation  obtained  in  the  experiments  are  strongly  affected  by  the 
familiarity  of  the  object  (i.e.,  letters  versus  geometrical  shapes)  and  the 
type  of  rotation  required  (two-  or  three-dimensional).  Very  little  is  known 
about  the  general  robustness  of  this  task.  Cooper  and  Shepard  (31)  maintain 


that  there  are  large  and  consistent  individual  differences  in  rotation  rates 
altnough  this  assertion  has  not  been  tested  with  large  populations.  Prelim¬ 
inary  testing  conducted  at  NAMRL  indicates  that  the  use  of  low  resolution 
displays  may  seriously  degrade  performance.  Additionally,  instructions  for 
this  test  seem  to  be  particularly  difficult  to  develop. 

Choice  Reaction  Time 


The  choice  reaction  time  task  is  one  of  the  oldest  tests  in  psychology 
and  has  been  studied  for  well  over  100  years  (32).  This  task  appears  to  be 
a  great  deal  more  robust  than  the  other  three  tasks  described  in  this 
section.  That  is,  the  basic  linearity  of  the  function  relating  correct 
reaction  time  to  the  amount  of  information  transmitted  is  affected  less  by 
methodological  variations  than  comparable  functions  for  the  other  three 
tasks.  The  major  drawback  to  the  task  is  that  no  real  baseline  data  are 
available;  correct  reaction  times  from  this  task  are  affected  by  the  stim¬ 
ulus  and  response  modalities,  the  stimulus  domain,  t.he  configuration  of  the 
response  device  for  manual  responses  (described  below),  and  practice.  Thus, 
even  though  the  general  effect  of  many  experimental  variables  on  the  reac¬ 
tion  time  function  is  known,  observed  correct  reaction  times  cannot  be 
predicted  very  accurately. 

DEPENDENT  MEASURES 

One  major  characteristic  of  the  four  tasks  described  above  is  that  they 
all  measure  rate  of  information  processing.  Thus,  for  all  four  of  these 
tasks,  a  linear  regression  Is  calculated  using  the  raw  correct  reaction  time 
scores  from  various  conditions  (degrees  of  stimulus  rotation  for  the  mental 
rotation  task  or  set  size  for  the  other  three  tasks).  The  major  dependent 
variable  for  all  of  these  tasks  is  the  slope  of  the  regression  equation 
although  the  intercept  is  also  of  both  practical  and  theoretical  importance. 

PRACTICAL  PROBLEMS 

A  number  of  practical  problems  occur  with  these  tasks  because  slope  is 
the  dependent  variable  of  interest.  The  primary  problem  involves  practice. 
Subjects  must  receive  relatively  extensive  practice  on  these  tasks  to  pro¬ 
duce  data  that  are  fit  well  by  linear  regression.  Enough  time  may  not  be 
available  to  allow  sufficient  practice  in  an  applied  situation.  Even  if  the 
data  are  described  adequately,  the  slopes  may  not  be  stable,  and  even  more 
practice  may  be  required.  (See  reference  33  for  a  discussion  of  reliability 
problems  associated  with  the  use  of  slopes.) 

A  second  practical  problem  concerns  the  set  sizes  that  can  be  investi¬ 
gated  for  the  Sternberg,  Neisser,  and  choice  reaction  time  tasks.  Because 
human  short-term  memory  is  limited,  the  number  of  .items  to  be  held  in  memory 
in  the  Sternberg  and  Neisser  tasks  is  usually  limited  to  six.  However,  some 
normal  adults  cannot  retain  six  items,  and  occasionally  the  standard  devia¬ 
tion  of  correct  reaction  times  at  the  six-item  level  is  much  larger  than  at 
the  other  levels.  This  can  Cause  problems  with  subsequent  statistical 
analyses.  Therefore,  the  investigator  may  want  to  limit  the  maximum  number 
of  xtems  to  be  held  in  memory  to  five. 
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The  comparable  problem  for  the  choice  reaction  time  task  is  slightly 
different.  Because  all  of  the  items  may  not  be  held  in  short-term  memory  at 
the  same  time,  this  task  does  not  appear  to  be  as  affected  by  memory  limita¬ 
tions  as  the  Sternberg  and  Neisser  tasks.  The  major  limitation  for  this 
task  concerns  the  method  of  making  a  response.  If  the  investigator  wants 
the  subject  to  respond  manually  to  the  stimuli,  the  number  of  distinct 
responses  is  limited  to  10.  Normally,  this  implies  that  the  maximum  number 
of  stimuli  will  also  be  limited  to  10  unless  the  investigator  wants  to 
examine  many-to-one  mappings,  which  involve  other  considerations.  On  the 
other  hand,  if  the  investigator  can  allow  the  subject  to  respond  vocally, 
the  number  of  responses  is  theoretically  unlimited. 

Manual  responses  for  choice  reaction  time  tasks  also  involve  two  other 
related  problems.  If  the  task  requires  more  than  four  different  responses, 
the  response  device  must  be  configured  either  to  allow  movement  of  the 
finger  of  the  dominant  hand  or  use  of  the  nondominant  hand.  The  nondominant 
hand  normally  produces  reaction  times  somewhat  slower  than  the  dominant 
hand,  introducing  a  bias  into  the  data.  Similarly,  the  ring  and  little 
fingers  produce  responses  that  are  longer  than  those  produced  by  the  index 
and  second  fingers,  particularly  for  the  nondominant  hand.  Currently,  only 
one  accepted  technique  can  eliminate  this  bias:  use  both  hands  and  analyze 
only  the  data  from  the  index  fingers  of  each  hand.  The  problem  with  this 
technique  is  that  only  a  fraction  of  the  responses  emitted  by  the  subject 
are  used.  To  obtain  good  estimates  of  various  parameters,  the  subject  must 
make  many  more  responses  than  if  the  data  from  all  the  fingers  were  ana¬ 
lyzed. 

A  number  of  investigators  have  chosen  to  circumvent  the  problems  related 
to  speed  of  response  by  allowing  the  subject  to  respond  using  only  the  index 
finger  of  the  dominant  hand.  Typically,  the  subject  keeps  the  index  finger 
on  a  'home'  key  and  moves  it  to  the  response  keys.  This  introduces  a  travel 
time  (distance)  that  is  added  to  the  true  reaction  time.  If  the  response 
keys  are  all  the  same  distance  from  the  home  key,  the  travel  time  should  be 
a  constant,  and  no  bias  is  introduced.  However,  for  some  common  response 
devices,  such  as  a  4  by  A  matrix  keypad,  the  travel  time  may  not  be  con¬ 
stant.  In  this  situation,  either  a  large  percentage  of  the  responses  must 
be  excluded  from  the  analyses,  or  extensive  baseline  data  must  be  collected 
to  obtain  estimates  of  the  travel  time. 

All  four  of  these  tests  also  have  problems  with  the  zero-choice  or  the 
zero-rotation  situation.  Usually,  the  linear  regression  equation  fitting 
the  correct  reaction  time  scores  to  the  set  size,  number  of  alternatives,  or 
degrees  of  rotation  accounts  for  a  large  percentage  of  the  variance,  typi¬ 
cally  more  than  70%.  However,  if  the  correct  reaction  time  scores  from  the 
set  size  1,  1  alternative,  or  0  degrees  of  rotation  condition  is  added  to 
the  equation,  the  percentage  of  variance  accounted  for  by  the  equation  fre¬ 
quently  drops.  Visual  inspection  of  the  data  usually  reveals  that  the 
additional  data  point  has  a  larger  mean  than  would  be  predicted  from  the 
previous  data  points.  To  date,  no  explanation  for  this  finding  has  been 
generally  accepted,  which  indicates  a  lack  of  knowledge  about  choice  or 
rotation  situations  vers"s  no  choice  or  no  rotation  situations.  From  a 
practical  point  of  view,  the  investigator  should  calculate  two  equations  for 
the  experimental  test.  One  equation  should  use  ail  the  available  data; 


the  other  should  exclude  the  set  size  1,  1  alternative,  or  zero  degree 
rotation  condition.  The  investigator  should  use  the  equation  that  explains 
the  most  variance. 

Finally,  one  major  methodological  problem  exists  for  an  investigator  who 
wishes  to  use  the  Sternberg  task.  This  problem  concerns  the  type  of  map¬ 
ping,  varied  or  constant,  to  be  used.  For  applied  research,  the  constant 
mapping  procedure  is  normally  used  because  more  data  can  be  collected  in  a 
given  period  of  time.  However,  extensive  practice  with  the  constant  mapping 
procedure  can  lead  to  'automatic  processing'  (34,35).  If  automatic  proces¬ 
sing  occurs,  the  slope  of  the  function  relating  correct  reaction  time  to  set 
size  becomes  zero,  indicating  an  infinitely  fast  rate  of  processing. 


7.  HIGHER  PROCESSES 


This  chapter  describes  tests  that  are  assumed  to  assess  more  complex 
cognitive  functions  than  those  assessed  by  the  rate-of-information-proces- 
sing  tests.  Tests  of  higher  cognitive  processes  generally  resemble  real- 
world  activities  and,  therefore,  are  of  interest  to  investigators  examining 
applied  problems.  Many  tests  of  higher  processes  are  currently  available. 
Three  of  the  most  common  are  described  below.  A  fourth  section  on  task 
combinations  is  included  because  of  the  recent  interest  in  assessing  time¬ 
sharing  skills  and  abilities. 

MENTAL  ARITHMETIC 

Overview 


Mental  arithmetic  tasks  are  probably  the  most  frequently  used  higher 
processes  tests.  Using  the  second  classification  scheme,  which  is  based  on 
Uickens'  Model,  these  tasks  require  verbal  processing  code  resources  and 
early  rather  than  late  resources.  They  may  require  either  visual  or  audi¬ 
tory  resources  and  either  manual  or  vocal  resources  depending  on  the  imple¬ 
mentation  of  the  task.  Using  the  third  classification  scheme,  these  tasks 
all  require  the  use  of  both  short-  and  long-term  memory. 

Mental  arithmetic  tasks  are  used  frequently  in  performance  batteries  for 
several  reasons.  One  reason  is  that  they  have  high  face  validity.  That  is, 
mental  arithmetic  is  required  in  many  real-world  activities.  By  including  a 
mental  arithmetic  task  in  a  battery,  the  investigator  appears  to  be  exam¬ 
ining  relevant  skills  and  abilities.  A  second  reason  for  including  mental 
arithmetic  tasks  is  that  almost  all  adult  subjects  have  the  necessary  skills 
to  perform  at  least  simple  versions  of  this  task  and  to  understand  the 
relevant  instructions  quickly.  A  third  reason  for  the  popularity  of  these 
tasks  is  their  diversity.  Stimuli  can  be  presented  either  auditorily  or 
visually  for  many  versions  of  these  tasks,  and  subjects  can  respond  manually 
or  vocally.  Additionally,  mental  arithmetic  tasks  vary  greatly  in  difficul¬ 
ty  and  complexity.  Thus,  the  investigator  has  a  wide  range  of  potential 
tasks  available. 

Task  Development 

Typically,  the  most  difficult  problem  facing  an  investigator  who  wants 
to  include  a  mental  arithmetic  task  in  a  performance  battery  is  selecting  a 
task  that  is  appropriate  for  the  subject  population.  To  identify  such  a 
task,  the  investigator  should  examine  the  difficulty  of  the  mental  arithme¬ 
tic  tasks  under  consideration  carefully.  One  factor  that  affects  the  diffi¬ 
culty  is  the  amount  of  information  the  subject  must  remember  to  perform  the 
task.  For  example,  some  tasks  require  the  subject  to  perform  multiple 
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Another  reason  for  using  mental  arithmetic  tasks  is  that  at  least  some 
of  them  appear  to  follow  the  standard  stage  model  of  human  information 
processing  developed  by  Sternberg  (3).  See  Ashcraft  and  Battaglia  (36)  for 
a  discussion  of  a  mental  arithmetic  task  that  follows  such  a  model.  If  this 
is  the  case,  then  the  additive  factors  logic  can  be  applied  to  at  least  some 
mental  arithmetic  tasks. 
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operations  on  a  pair  of  numbers.  To  do  this,  the  subject  must  remember  the 
sequence  of  operations  to  be  performed  as  well  as  the  results  of  the  imme¬ 
diately  preceding  operation.  The  more  numbers  and  operations,  the  more 
difficult  the  task,  and  the  less  appropriate  the  task  becomes  for  some 
subject  populations. 

Task  difficulty  directly  affects  the  probability  of  detecting  experi¬ 
mental  effects  by  influencing  both  the  number  of  responses  emitted  in  a 
given  period  and  their  accuracy.  Thus,  an  easy  mental  arithmetic  task,  such 
as  adding  a  constant  to  the  stimulus,  may  result  in  a  large  number  of 
responses  in  a  given  period  of  time  while  a  more  difficult  task  will  result 
in  relatively  few  responses.  If  the  investigator  needs  numerous  responses 
in  a  short  period  of  time  to  detect  an  effect,  then  an  easy  mental  arith¬ 
metic  task  may  be  preferred.  Task  difficulty  also  affects  the  error  rate. 

If  the  task  is  too  difficult  for  the  subjects,  they  may  respond  at  the 
chance  accuracy  level.  Consequently,  any  performance  decrements  caused  by 
experimental  factors  may  be  impossible  to  detect. 

A  list  of  some  of  the  common  mental  arithmetic  tasks  is  given  below  with 
at  least  one  reference  per  task.  The  purpose  of  this  list  is  to  provide  the 
reader  with  some  idea  of  the  types  of  tasks  available. 

Addition  of  a  Constant.  This  is  the  simplest  type  of  mental  arithmetic 
task.  The  subject  is  required  to  add  a  constant  to  a  number  or  set  of 
numbers.  Because  this  task  is  so  simple,  it  is  frequently  paced  or  perform¬ 
ed  concurrently  with  another  task  (37). 

Running  Difference.  The  subject  subtracts  the  most  recent  digit  from 
the  preceding  digit  and  enters  the  difference.  As  soon  as  the  subject 
responds,  a  new  digit  is  presented,  which  the  subject  subtracts  from  the 
immediately  preceding  digit  (13,38,39).  This  task  may  also  be  performed 
as  a  running  sum  task,  in  which  the  subject  adds  the  most  recent  two 
digits  and  reports  the  sum  (40). 

Two-digit  Addition.  The  subject  is  presented  with  two  two-digit  numbers 
to  be  added.  The  subject  reports  the  sum  and  immediately  is  presented  with 
two  more  two-digit  numbers  (41). 

Complex  Operations.  These  tasks  require  the  addition  or  subtraction  of 
multiple  digit  numbers  that  may  be  displayed  either  vertically  or  horizon¬ 
tally  (see  references  42  and  43).  Tasks  requiring  multiple-digit  division 
or  multiplication  also  fall  into  this  group  although  examples  of  these  types 
of  tasks  are  less  frequent  (44). 

Criteria  Verification.  Three  one-digit  numbers  are  presented  as  a  se¬ 
quence  to  the  subject.  The  subject  must  decide  if  the  sequence  meets  one  or 
more  criteria.  For  example,  Griffiths  and  Boyce  (45)  had  subjects  determine 
if  a  sequence  met  one  of  two  criteria:  (I)  the  first  digit  was  the  largest, 
and  the  second  digit  was  the  smallest;  or  (2)  the  third  digit  was  the 
largest,  and  the  first  digit  was  the  smallest.  The  subject  made  one  re¬ 
sponse  if  the  digits  met  the  criteria  and  another  response  if  they  did  not. 

Multiple  Operations.  Many  varieties  of  tasks  use  multiple  arithmetic 
operations.  For  example,  Morgan  et  al.  (46)  required  subjects  to  add  two 
three-digit  numbers  and  then  subtract  a  third  three-digit  number  from  the 


sum.  Chiles  and  Jennings  (47)  required  the  same  sequence  of  operations 
using  two-digit  rather  than  three-digit  numbers.  As  another  exam  pie, 
Wiliiges  (48)  had  subjects  perform  a  mental  arithmetic  task  with  four  steps. 
The  subjects  were  presented  with  two  digits  and  added  five  to  the  smaller  of 
the  two.  After  the  addition,  the  subject  compared  the  two  digits  and  doub¬ 
led  the  smaller  digit.  Next,  the  subject  subtracted  the  smaller  from  the 
larger  and  compared  the  result  with  a  criterion  value.  If  the  result  ex¬ 
ceeded  the  criterion,  the  subject  made  one  response.  If  the  result  did  not 
exceed  the  criterion,  the  subject  made  a  second  response. 

Dependent  Measures 

Accuracy  scores  have  traditionally  been  the  primary  performance  measures 
for  mental  arithmetic  tasks.  These  include  percentage  correct,  number  of 
correct  responses  in  a  specified  time,  and  so  forth.  More  recently,  reac¬ 
tion  times  have  become  common  measures  of  performance  although  they  are 
almost  always  used  in  conjunction  with  some  type  of  accuracy  score. 

Practical  Problems 


The  investigator  must  be  concerned  with  few  practical  problems  beyond 
those  associated  with  task  selection.  Mental  arithmetic  tasks  are  easy  to 
program  and  debug  and  can  be  presented  using  relatively  poor  quality  graphic 
systems.  Thus,  the  only  practical  problem  concerns  measuring  reaction  time. 
Many  mental  arithmetic  tasks  require  multiple  digit  responses  that  must  be 
entered  using  some  type  of  keypad.  Subjects  usually  require  some  type  of 
familiarization  with  the  keypad,  and  the  investigator  must  allow  adequate 
time  for  this  process. 

The  investigator  may  require  all  subjects  to  enter  responses  in  a  stand¬ 
ardized  fashion.  That  is,  all  subjects  may  be  required  to  place  their 
fingers  on  a  specific  row  of  keys  or  to  use  only  one  finger  to  respond.  In 
contrast,  some  investigators  feel  that  standardizing  any  aspect  of  the  data 
entry  process  reduces  the  face  value  of  the  test  and,  consequently,  do  not 
specify  how  the  subject  should  enter  the  responses. 

VIGILANCE  TASKS 

Overview 


Vigilance  tasks  vary  widely  in  their  stimulus  mode,  response  mode,  and 
the  number  of  stimulus  sources  the  operator  must  monitor.  The  salient 
characteristic  of  all  of  these  tasks  is  that  they  require  sustained  atten¬ 
tion  for  a  relatively  long  time  period,  typically  at  least  50  min.  These 
tasks  are  difficult  to  describe  using  either  of  the  classification  schemes. 
Using  the  second  classification  scheme  based  on  Wickens'  Model,  these  tasks 
require  early  rather  than  late  processing  resources.  They  cannot  be  clas¬ 
sified  in  terms  of  the  stimulus  or  response  resource  requirements  because 
stimuli  may  be  presented  either  auditorily  or  visually,  and  responses  may  be 
made  either  manually  or  vocally.  Similarly,  they  cannot  be  described  in 
terms  of  their  code  of  processing  resources  because  they  may  require  either 
spatial  or  verbal  processes.  These  tasks  never  have  been  described  in  the 
cognitive  literature,  but  they  seem  to  require  short-  and  long-term  memory 
and  pattern  recognition  processes. 


There  are  many  reasons  to  include  a  vigilance  task  in  a  performance 
battery.  One  of  the  most  important  is  that  these  tasks  simulate  many  impor~ 
tant  real-world  activities.  Thus,  the  investigator  can  increase  both  the 
applicability  of  the  results  and  the  face  validity  of  the  battery  by  includ¬ 
ing  vigilance  tasks  in  the  battery.  Another  important  reason  is  that  this 
type  of  task  has  been  thoroughly  investigated;  at  this  time,  more  than  1500 
studies  have  appeared  in  the  open  literature.  Additionally,  several  excel¬ 
lent  literature  reviews  have  been  published,  such  as  Craig  (49)  and  Parasur- 
aman  (50),  The  iuvestigator  may  be  able  to  decrease  substantially  the 
amount  of  pretesting  required  by  using  the  available  information  to  narrow 
the  range  of  several  task  parameters,  such  as  the  intensity  of  the  stimulus 
and  the  number  of  events  per  hour. 

There  are  also  a  number  of  practical  reasons  for  including  vigilance 
tasks  in  a  battery.  These  tasks  are  easy  to  program  and  require  little 
central  processing  capacity.  The  speed  of  the  central  processing  unit  is  of 
little  concern,  and  almost  any  microcomputer  can  be  used  successfully. 
Additionally,  very  little  hardware  is  needed  for  the  subject's  responses;  in 
most  cases,  the  subject  responds  manually  by  pressing  a  key  to  indicate  a 
signal.  The  stimuli  also  can  usually  be  presented  using  very  simple  hard¬ 
ware.  For  visual  stimuli,  only  primitive  graphics  are  normally  necessary; 
for  auditory  stimuli,  a  pure  tone  generator  and  a  white  noise  generator  are 
often  sufficient  unless  the  investigator  wants  to  simulate  a  specific  real- 
world  task,  such  as  sonar  operation. 

Vigilance  tasks  also  have  two  drawbacks.  The  first,  and  the  most  ser¬ 
ious,  is  that  subjects  often  find  these  tasks  boring.  As  a  result,  they  may 
fall  asleep  during  the  testing  session,  decide  to  stop  monitoring  the  task 
for  awhile,  or  adopt  some  new  way  of  responding,  such  as  pressing  keys  with 
their  elbows  or  feet.  The  r  may  also  decide  that  the  task  is  too  tedious  to 
be  tolerated  and  quit  the  t  sporiment,  If  any  of  these  situations  occurs, 
the  experimenter  may  have  tc  discard  a  large  amount  of  data.  The  second 
major  drawback  concerns  training  the  subject.  Most  normal  adult  subjects 
understand  vigilance  instructions,  but  many  have  difficulty  learning  to 
detect  signals  reliably.  Subjects  may  repeat  the  training  session  several 
times  before  reaching  the  performance  criteria  necessary  to  begin  the  test¬ 
ing  session.  A  few  subjects  never  reach  the  criteria.  Thus,  the  investiga¬ 
tor  must  allow  for  lengthy  training  sessions  and  for  replacing  subjects  who 
cannot  reach  criteria. 


Task  Develot 


To  develop  a  vigilance  task,  the  investigator  must  determine  the  stimu¬ 
lus  mode,  the  response  mode,  the  number  of  stimulus  sources,  and  type  of 
discrimination  required  (successive  versus  simultaneous)  by  the  task.  The 
choice  between  visual  and  auditory  stimuli  appears  to  be  completely  arbit¬ 
rary  unless  the  investigator  intends  to  apply  the  results  to  a  specific 
real-world  activity.  Vigilance  tasks  generally  require  manual  responses 
although  vocal  responses  are  theoretically  as  acceptable  as  manual  re¬ 
sponses.  Manual  responses  probably  have  been  used  almost  exclusively  to 
date  simply  for  convenience.  Subjects  may  be  required  to  monitor  one 
display  or  several  displays  for  a  signal.  The  vast  majority  of  the  litera¬ 
ture  has  examined  single-source  monitoring,  but  again  the  choice  may  depend 
on  the  desired  applicability  of  the  data. 


Finally,  the  investigator  must  decide  between  successive  versus  simul¬ 
taneous  discrimination  of  signals  and  nonsignals.  Successive  discrimination 
requires  that  a  stimulus  change  repetitively  in  two  ways.  The  most  common 
change  is  usually  defined  to  be  a  nonsignal;  the  less  common  type  of  change 
is  defined  as  a  signal.  For  example,  Williges  (51)  had  an  abstract  geo¬ 
metric  figure  change  brightness  periodically  from  5  to  4  fL  (17.13  to  13.70 
cd/m^).  A  1.3-s  period  of  dimness  was  a  signal;  a  1.7-s  period  of  dimness 
was  a  nonsigual.  Simultaneous  discrimination  requires  the  presence  of  both 
the  signal  and  the  nonsignal  either  in  the  same  stimulus  or  at  the  same 
time.  A  good  example  of  simultaneous  discrimination  is  detecting  a  weak 
pure  tone  (the  signal)  against  a  oackground  of  white  noise  (the  nonsignal). 
The  major  difference  between  simultaneous  and  successive  discrimination  from 
an  information  processing  standpoint  is  that  successive  discrimination  im¬ 
poses  a  short-term  memory  load  on  the  subject  that  is  not  required  by  simul¬ 
taneous  discrimination;  the  subject  must  remember  the  characteristics  of  the 
signal  to  compare  it  with  a  nonsignai  in  the  successive  discrimination 
si tuation. 


The  choice  between  these  two  types  of  discrimination  again  appears  to  be 
dictated  by  the  applicability  of  the  results.  If  the  investigator  wants  the 
data  to  be  immediately  applicable  to  a  specific  task,  then  the  experimental 
task  must  use  the  same  type  of  discrimination.  If  simulating  a  real-world 
task  is  not  necessary,  the  choice  of  discrimination  is  arbitrary.  However, 
data  obtained  using  the  simultaneous  discrimination  paradigm  require  more 
time-consuming  analyses  than  those  obtained  using  successive  discrimination 
(see  reference  50  for  a  succinct  discussion  of  these  problems). 


sndent  Measures 


Traditionally,  vigilance  task  performance  is  measured  by  the  probability 
of  detecting  [ P(D) ]  a  signal.  To  obtain  this  measure,  the  experimental 
session  is  divided  into  a  number  of  equal  time  periods,  and  the  probability 
of  detecting  signals  presented  in  each  period  is  calculated.  In  most  cases 
P(D)  decreases  across  the  time  periods.  This  decrease  is  called  the  "vigi¬ 
lance  decrement."  False  alarms  (FA)  are  also  often  calculated  for  each  time 
period,  and  occasionally  the  average  reaction  time  for  correct  signal 
detections  is  obtained.  Some  investigators  (52)  maintain  that  calculating 
P(D)  and  FA  over  a  period  of  time,  such  as  10  or  15  min,  provides  perform¬ 
ance  measures  that  are  so  crude  as  to  be  misleading.  These  investigators 
advocate  a  more  fine-grained  approach  in  which  the  probability  of  detecting 
each  signal  is  calculated,  and  statements  about  performance  are  based  on 
trends  in  detection  evident  across  signals.  This  approach  was  useful  in 
several  experiments  but  was  never  widely  accepted. 


Both  the  traditional  approach  based  on  P(D)  and  the  f ined-grained  ap¬ 
proach  have  been  replaced  for  the  most  part  by  Signal  Detection  Theory 
(SDT),  McNicol  (53)  gives  a  good  intuitive  explanation  of  this  theory  with 
many  practical  examples.  Green  and  Swets  (54)  provide  a  more  rigorous 
explanation.  The  major  reason  for  adopting  SDT  is  that  this  theory  sepa¬ 
rates  change  in  the  subject's  ability  to  discriminate  a  signal  from  a  non¬ 
signal  from  the  subject's  willingness  to  respond  "signal"  or  "nonsignal." 
Thus,  SDT  is  an  extremely  powerful  theory  that  has  provided  many  insights 
into  vigilance  behavior. 


Estimates  of  the  subject's  ability  to  discriminate  a  signal  from  a 
nonsignal  are  reflected  in  a  dependent  measure  referred  to  as  "d'."  The 
subject's  willingness  to  respond  is  reflected  in  a  measure  referred  to  as 
"beta."  To  calculate  d',  the  P(D)  and  the  number  of  FAs  must  be  calculated 
for  each  time  interval  of  interest.  To  calculate  beta,  the  subject  must  be 
told  the  a  priori  probability  of  signals  and  nonsignals.  Additionally,  the 
subject  should  be  given  a  payoff  matrix  with  specified  rewards  for  each 
correct  detection  of  a  signal  and  each  correct  rejection  of  a  nonsignal  and 
penalties  for  each  missed  signal  and  FA.  Thus,  to  use  SDT,  the  investigator 
must  provide  the  subject  with  more  information  than  is  typically  given  when 
the  traditional  data  analysis  approach  is  followed.  Correct  reaction  times 
may  be  obtained  in  addition  to  d*  and  beta,  but  these  are  secondary  mea¬ 
sures. 

Practical  Problems 


A  few  practical  problems  should  be  considered  before  including  a  vigi¬ 
lance  task  in  a  performance  battery,  but  these  are  neither  as  numerous  nor 
as  serious  as  those  associated  with  the  rate-of-informntion-processing 
tasks.  As  noted  earlier,  some  subjects  have  a  great  deal  of  difficulty 
reaching  training  criteria,  and  a  few  subjects  never  reach  the  criteria. 
Such  difficulties  imply  that  the  investigator  must  allow  a  large  amount  of 
training  time  and  must  have  more  than  the  minimum  number  of  subjects  avail¬ 
able  . 


A  more  serious  problem  pertaining  to  training  concerns  the  ratio  of  the 
signals  to  nonsignals  presented  during  the  training  session.  Typically, 
investigators  have  used  signal- to-nonsignal  ratios  in  training  that  are  much 
higher  than  those  encountered  in  the  testing  sessions.  Colquhoun  and  Badde- 
iey  (55,56)  have  demonstrated  that  subjects  trained  under  a  signal-to- 
nonsignal  ratio  that  is  higher  than  the  ratio  used  in  the  testing  session 
show  larger  vigilance  decrements  than  subjects  trained  with  the  same  ratio 
used  in  the  testing  session.  Craig  and  Colquhoun  (57)  suggest  that  much  of 
the  observed  vigilance  decrement  is  caused  by  training  with  inappropriate 
signal- to-nonsignal  ratios.  Craig's  (58)  analysis  of  data  in  the  open 
literature  supports  this  assertion. 

The  primary  reason  for  using  training  ratios  that  are  higher  than  those 
of  the  testing  session  is  to  provide  the  subject  with  sufficient  practice  in 
distinguishing  signals  from  nonsignals  in  as  short  of  time  as  possible.  If 
the  investigator  uses  the  same  ratio  in  the  testing  and  training  sessions, 
the  length  of  the  training  session  must  be  increased  to  determine  if  the 
subject  can  detect  signals  reliably.  Thus,  investigators  have  had  to  choose 
between  inappropriate  training  ratios  and  long  training  sessions.  Recently, 
Williams  (59)  proposed  a  new  training  technique  based  on  probability  match¬ 
ing,  which  uses  the  appropriate  signal- to-nonsignal  ratio  in  a  relatively 
short  training  session.  Williams  demonstrated  that  this  technique  could 
eliminate  some,  but  not  all,  of  the  vigilance  decrement. 

Another  problem  concerns  the  payoff  matrix  used  co  determine  beta,  one 
of  the  SDT  measures.  Presumably,  subjects  must  be  given  a  payoff  matrix  to 
support  the  SDT  assumptions  underlying  the  calculation  of  beta.  It  may  not 
be  possible,  however,  to  pay  some  subject  populations,  such  as  active  duty 
military  personnel.  Only  one  study  (60)  compared  the  performance  of  sub¬ 
jects  receiving  a  cash  payoff  matrix  with  those  receiving  no  payoff  matrix. 
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No  difference  in  performance  was  found  for  these  two  groups;  however,  Wiener 
was  not  using  SOT  to  analyze  his  data. 

The  detectability  of  the  signal  as  compared  to  the  nonsignals  must  also 
be  given  serious  consideration.  If  the  signal  is  obvious,  the  subjects  will 
detect  all  of  the  signals  and  will  make  no  FAs.  Although  such  data  can 
still  be  analyzed  using  the  traditional  approach,  beta  will  be  mathematical¬ 
ly  indeterminate.  The  possible  loss  of  one  of  two  SOT  measures  should  be 
carefully  considered  when  the  signals  and  nonsigi  .Is  are  selected. 

Finally,  the  investigator  should  be  aware  of  some  criticism  of  labora¬ 
tory  vigilance  tasks.  The  four  major  criticisms  are  that  the  length  of  the 
testing  session  is  too  short,  too  few  sessions  are  administered,  the  signal 
rate  is  too  high  to  simulate  any  real-world  activity,  and  naive  subjects  are 
used.  All  of  these  criticisms  are  justified  to  some  extent.  Only  one  study 
(61)  used  a  signal  rate  typical  of  many  real-world  systems  (one  signal  per 
week)  and  examined  subject  behavior  over  «  6-month  period.  The  reasons  for 
using  a  high  signal  rate,  collecting  data  during  a  few  short  testing  per¬ 
iods,  and  employing  naive  subjects  are  to  generate  sufficient  data  for 
analysis  and  to  keep  costs  down.  Thus,  economic  constraints  and  data  analy¬ 
sis  considerations  may  reduce  the  applicability  of  any  laboratory  task  to 
real-world  behavior  despite  the  best  intentions  of  the  investigator. 

TRACKING 

Overview 


Tracking  tasks  can  be  described  using  Wic kens'  Multiple  Resources  Model. 
They  require  response  rather  than  perceptual  or  central  resources,  spatial 
rather  than  verbal  processing  code  resources,  manual  rather  than  vocal 
response  resources  (with  a  few  exceptions),  and  visual  rather  than  auditory 
stimulus  resources  (also  with  some  exceptions).  Using  the  third  classifica¬ 
tion  system,  the  general  consensus  is  that  tracking  tasks  require  spatial 
processes  and  may  require  spatial  short-term  memory. 

There  are  two  primary  reasons  for  including  a  tracking  task  in  a  per¬ 
formance  battery.  The  first  is  to  increase  the  applicability  of  the  data 
from  the  battery.  This  is  a  legitimate  reason  if  the  battery  is  designed  to 
examine  skills  and  abilities  of  activities  that  require  tracking.  The 
investigator,  however,  should  consider  the  relation  between  the  real-world 
activity  and  any  potential  laboratory  tasks  carefully  before  deciding  to  add 
a  tracking  task  to  the  battery.  Bertram  et  al.  (62)  demonstrate  that  track¬ 
ing  tasks  that  differ  in  the  display  type,  number  of  dimensions  of  movement 
of  the  cursor,  or  the  allocation  of  controls  to  the  limbs  correlate  poorly. 
Thus,  data  obtained  from  a  laboratory  tracking  task  may  not  be  applicable  to 
a  real-world  activity  if  the  two  differ  on  any  of  the  dimensions  noted  by 
Bartram  et  al.  There  is  also  reason  to  suspect  that  differences  in  other 
parameters,  such  as  control  order,  will  lower  between-task  correlations. 

The  second  reason  to  include  tracking  tasks  in  a  battery  is  to  obtain 
measures  of  skills  and  abilities  that  are  not  assessed  by  any  other  type  of 
task.  This  implies  that  performance  on  tracking  tasks  should  correlate 
poorly  with  performance  on  other  tasks.  Interestingly,  few  data  show  that 
the  computer-generated  tracking  tasks  used  by  applied  psychologists  do 
correlate  poorly  with  the  skills  and  abilities  required  by  other  tasks. 


Only  one  study  (63)  used  a  computer-generated  tracking  task  to  examine  the 
structure  of  human  abilities.  (This  experiment  was  concerned  primarily  with 
the  existence  of  a  general  timesharing  ability.)  This  tracking  task  per¬ 
formed  singly  and  in  combination  with  the  other  tasks  of  the  battery  had 
significant  loadings  on  only  one  factor  of  the  solution;  no  other  tasks  had 
significant  loadings  on  this  factor.  These  results  seem  to  indicate  that 
tracking  requires  unique  skills  and  abilities,  but  more  research  is  needed 
before  any  firm  conclusions  can  be  drawn. 

Task  Development 

Five  major  issues  concerning  the  development  of  a  tracking  task  are 
given  below.  Good  tracking  tasks  are  difficult  to  develop  and  almost  impos¬ 
sible  to  debug.  Many  good  tracking  tasks  cannot  be  programmed  on  some 
micro-  and  minicomputers  because  of  their  relatively  slow  processing  speed 
and  limited  memory.  Investigators  with  no  experience  constructing  a  track¬ 
ing  task  should  obtain  the  software  from  a  reliable  source,  if  possible,  and 
consult  with  knowledgeable  individuals  about  the  necessary  equations. 

An  investigator  must  decide  to  use  either  a  pursuit  or  a  compensatory 
display  early  in  the  development  of  the  task.  Pursuit  displays  present  the 
input  (command)  information  on  one  display  element  and  the  system  output  on 
a  second  display  element.  Compensa  tory  displays  simply  present  the  differ¬ 
ence  between  the  input  information  and  the  system  output  on  one  display 
element.  Pursuit  displays  tend  to  result  in  better  performance  than  com¬ 
pensatory  displays  for  several  reasons  (9).  Chief  among  these  is  that  the 
operator  can  view  the  input  directly  and  learn  any  regularities.  Addition¬ 
ally,  the  operator  can  distinguish  between  the  changes  caused  by  the  input 
and  changes  caused  by  control  responses. 

Most  traditional  laboratory  tasks  use  compensatory  displays  because  it 
is  easier  to  model  an  operator  tracking  a  compensatory  than  a  pursuit  dis¬ 
play.  This  reason  is  usually  irrelevant  to  investigators  constructing 
performance  batteries.  A  second  reason  for  using  compensatory  displays  is 
that  at  one  time  a  compensatory  display  may  have  been  easier  to  construct 
than  a  pursuit  display.  This  consideration  is  also  irrelevant  if  the  track¬ 
ing  tasks  will  be  implemented  on  micro-  or  minicomputers.  Thus,  the  deci¬ 
sion  between  the  two  display  types  may  be  completely  arbitrary. 

If  a  compensatory  display  is  selected,  the  investigator  must  be  concern¬ 
ed  with  the  point  at  which  the  forcing  function^  is  injected  into  the 
tracking  task.  There  are  two  primary  points  at  which  the  forcing  function 
can  be  introduced.  In  Figure  la,  the  forcing  function  is  introduced  after 
the  system  dynamics  have  transformed  the  operator's  response.  In  Figure  lb, 
the  forcing  function  is  first  added  to  the  operator's  response,  and  then  the 
sum  is  acted  upon  by  the  system  dynamics.  The  configuration  shown  in  Figure 
lb  is  not  preferred  because  if  higher-order  system  dynamics  are  used,  they 
may  effectively  filter  the  forcing  function  at  higher  frequencies.  As  a 
result,  the  operator  may  experience  a  forcing  function  that  is  considerably 
different  from  the  one  generated  by  the  computer. 

The  investigator  must  also  decide  the  control  system  order.  The  order 
of  a  system  refers  to  the  number  of  time  integrations  performed  on  the 
control  responses.  For  example,  no  integrations  are  performed  in  a  zero- 
order  (position)  system.  Thus,  moving  the  control  stick  to  a  given  position 


always  results  in  the  cursor  moving  to  a  given  position  on  the  display.  One 
time  integration  is  performed  in  a  first-order  (rate)  system.  Thus,  moving 
the  control  stick  to  a  given  position  results  in  a  specific  velocity  of  the 
cursor.  Two  time  integrations  are  performed  in  a  second-order  (accelera¬ 
tion)  system.  Moving  the  control  stick  to  a  specific  location  results  in  a 
specific  acceleration  of  the  cursor.  Although  higher-order  systems  can  be 
constructed,  they  are  of  little  interest  to  an  investigator  developing  a 
performance  battery. 

Generally,  zero-order  control  systems  result  in  tasks  that  are  easy 
enough  to  be  boring.  Consequently,  they  are  of  little  interest  to  anyone 
developing  a  tracking  task  for  a  normal  adult  population.  Second-order 
control  systems  are  too  difficult  for  a  normal  adult  population  if  the 
bandwidth  of  the  forcing  function  exceeds  0.4  Hz;  data  obtained  using  these 
types  of  systems  tend  to  show  nonlinearities  (see  reference  64).  Therefore, 
an  investigator  constructing  a  performance  battery  can  choose  between  a 
first-order  system,  a  second-order  system  with  a  limited  forcing  function, 
or  a  system  consisting  of  a  combination  of  a  first-  and  second-order  system. 
Such  systems  may  be  constructed  either  by  adding  the  weighted  outputs  of  a 
first-  and  e  second-order  system  or  by  placing  a  first-order  system  in 
series  with  a  first-order  lag  with  an  adjustable  time  constant.  Although 
there  is  little  to  aid  in  selecting  between  these  three  control  orders, 
second-order  systems  are  used  more  infrequently  than  first-order  or  weighted 
combination  systems  in  most  types  of  human  performance  research. 

Anothe'  major  decision  concerns  the  forcing  function  used  as  the  input 
to  the  system.  The  investigator  must  first  decide  between  using  band- 
limited  noise  and  a  function  consisting  of  the  sum  of  sine  waves.  The  major 
disadvantage  of  band-limited  noise  is  that  control  theory  analyses  of  the 
operator's  behavior  are  more  difficult  to  perform.  The  major  advantage  of 
band-limited  noise  is  that  it  requires  leas  complicated  algorithms  to  gener¬ 
ate  than  sum-of-sine- waves  function.  Thus,  if  the  investigator  must  use 
computers  with  limited  (less  than  approximately  256K)  random  access  memory 
(RAM),  it  may  be  impossible  to  generate  a  sum-of-sine-waves  forcing  func¬ 
tion,  leaving  band-limited  noise  as  the  only  alternative. 

Constructing  a  sum-of-sine-waves  forcing  function  requires  the  investi¬ 
gator  to  decide  the  number  and  frequency  of  the  sine  waves  composing  the 
function.  At  least  three  nonharmonically  related  sine  waves  must  be  used  to 
achieve  a  random-appearing  function.  The  total  number  of  sine  waves  used  to 
construct  the  function  is  usually  limited  by  the  size  of  RAM  and  the  speed 
of  the  central  processing  unit.  1  use  a  forcing  function  consisting  of  nine 
sine  waves,  a  rather  common  number. 

After  deciding  on  the  number  of  sine  waves  to  be  used,  the  investigator 
must  select  the  range  of  the  sine  waves.  The  range  of  usable  frequencies  is 
extremely  limited;  McRuer  and  Jex  (64)  demonstrated  that  performance  is 
linearly  related  to  the  frequency  of  the  forcing  function  up  to  approximate¬ 
ly  1  Hz  for  zero-  and  first-order  systems  and  0.4  Hz  for  second-order 


^The  term  "forcing  function"  usually  refers  to  a  function  that  is 
applied  directly  to  the  system  dynamics.  The  terms  "input"  and  "command" 
usually  refer  to  a  function  applied  directly  to  the  subject's  display. 


systems.  Beyond  these  values,  performance  becomes  increasingly  nonlinear. 
After  the  investigator  has  decided  on  the  number  of  sine  waves  and  the 
bandwidth,  the  specific  sine  waves  are  either  selected  at  random  or  so  that 
their  frequencies  are  approximately  equally  spaced  when  plotted  on  a  loga¬ 
rithmic  scale. 


Occasionally,  investigators  have  included  higher-frequency  (above  0.4 
Hz)  sine  waves  in  the  forcing  function.  Such  frequencies  are  included 
either  when  the  investigator  wants  co  study  human  tracking  behavior,  in 
general,  or  when  performance  on  a  specific  system  with  high-frequency  inputs 
is  being  examined. 

Even  inexperienced  operators  can  learn  a  short  portion  of  a  forcing 
function  if  it  is  repeated  consistently.  The  easiest  way  to  avoid  any 
learning  effect  is  to  choose  a  starting  point  randomly  for  each  sine  wave  at 
the  beginning  of  each  trial.  The  forcing  function  for  a  given  trial  is 
constructed  by  adding  the  sine  waves,  beginning  with  the  randomly  selected 
starting  point  of  each  sine  wave. 

Finally,  the  investigator  must  consider  one  problem  that  occurs  with 
inexperienced  operators  using  a  compensatory  displays  The  operator  some¬ 
times  moves  the  control  “tick  in  the  wrong  direction,  causing  the  cursor  to 
be  displaced  as  far  as  possible  on  the  display.  Frequently,  the  operator 
does  not  realize  the  mistake  and  allows  the  cursor  to  remain  maximally 
displaced  for  several  seconds.  The  question  confronting  the  investigator 
concerns  the  system  response  after  the  operator  realizes  the  mistake  and 
moves  the  control  in  the  correct  direction.  In  some  systems,  the  cursor 
responds  as  soon  as  the  control  stick  is  moved  in  the  correct  direction.  In 
other  systems,  the  cursor  does  not  move  for  some  pre-established  time  after 
the  control  stick  is  moved  correctly,  penalizing  the  operator  for  not  recog¬ 
nizing  the  mistake  immediately.  The  problem  with  the  second  system  is  that 
an  inexperienced  operator  may  become  even  more  confused  when  the  cursor  does 
not  respond  immediately  to  a  control  movement.  The  investigator  must  decide 
between  penalizing  a  serious  mistake  and  increasing  the  possibility  that  the 
operator  will  become  confused  and  frustrated.  Most  investigators  feel  that 
the  probability  of  confusing  an  inexperienced  operator  is  high  and  have  con¬ 
structed  systems  that  respond  immediately  to  the  correct  movement  of  the 
control  stick  after  the.  cursor  has  been  maximally  displaced. 

Dependent  Measures 

Tracking  tasks  have  two  major  classes  of  dependent  measures:  classical 
error  measures,  such  as  RMS  and  average  absolute  error,  and  performance 
measures  derived  from  control  theory,  such  as  gain  and  phase  lag.  The 
classical  error  measures  are  discussed  in  detail  by  Poulton  (9),  Currently, 
the  moat  commonly  used  error  measures  are  RMS  and  average  absolute  error. 
Wickens  and  Gopher  (22)  describe  control  theory  measures,  which  give  a  more 
fine-grained  analysis  of  performance. 

Practical  Problems 


One  of  the  most  serious  practical  problems,  inadequate  RAM,  has  been 
noted  earlier.  Another  of  the  serious  problems,  the  inability  to  debug  the 
program,  has  also  been  mentioned  but  warrants  further  comment.  In  contrast 
to  all  of  the  other  tasks  discussed  in  this  document,  a  tracking  task  is 
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almost  impossible  to  test  by  performing  it.  Some  parts  of  the  control 
dynamics  can  be  tested  by  using  known  forcing  functions,  such  as  brief 
pulses,  as  input  and  recording  the  response  of  the  system  dynamics.  This 
approach  is  time-consuming  and  often  requires  special  equipment.  Addition¬ 
ally,  it  only  tests  the  system  dynamics. 

Tracking  tasks  require  good  graphics  systems.  Systems  with  poor  resol¬ 
ution  cause  the  cursor  to  hop  around  the  display  instead  of  moving  smoothly. 
The  phosphor  of  the  display  screen  also  may  present  problems;  long-persis¬ 
tence  phosphors  may  blur  the  cursor  as  it  moves.  This  is  very  distracting 
to  the  operator  and  may  induce  visual  fatigue. 

Two  other  issues  need  to  be  discussed  brief iy.  One  concerns  the  system 
gain.  For  most  tracking  systems,  the  rule  of  thumb  has  been  to  set  the  gain 
so  that  the  operator  can  overcome  the  maximum  amplitude  of  the  forcing 
function.  Usually,  the  programmer  can  determine  the  maximum  amplitude,  but 
the  gain  should  be  subsequently  testeo  to  ensure  that  the  system  dynamics 
are  not  too  responsive. 

The  second  issue  concerns  the  control  sticks.  The  investigator  must 
first  decide  between  isometric  (no  movement)  or  displacement  (movement) 
control  sticks.  The  type  of  control  stick  interacts  in  a  complex  manner 
with  several  parameters  of  the  tracking  system  to  affect  the  subject's 
performance.  A  description  of  these  interactions  is  given  in  Poulton  (9) 
and  Repperger  and  Levision  (65).  The  investigator  may  want  to  consider 
these  interactions  if  the  absolute  level  of  the  subject's  performance  is  of 
concern.  The  general  advantages  and  disadvantages  of  each  type  of  control 
stick  are  discussed  by  Frost  (66). 

The  investigator  will  discover  quickly  that  the  price  of  control  sticks 
varies  from  less  than  $100  to  more  than  $2000,  Generally,  the  most  expen¬ 
sive  sticks  are  displacement  sticks  with  a  guaranteed  linear  relation 
(usually  with  less  than  1%  error)  between  the  angle  of  displacement  and  the 
voltage  output  of  the  stick.  Host  investigators  will  not  need  this  degree 
of  accuracy  between  the  stick  displacement  and  the  voltage  output  unless 
they  plan  to  perform  a  control  theory  analysis  of  the  data. 

Finally,  the  investigator  should  remember  that  displacement  sticks  are 
subject  to  wearing.  As  a  result,  the  null  position  may  become  wider  over 
time  (that  is,  the  angular  displacement  necessary  to  signal  a  response  may 
increase  with  time),  and  the  voltages  resulting  from  the  maximum  angular 
displacement  of  the  stick  may  change.  To  account  for  this  wearing,  the 
investigator  may  have  to  recalibrate  the  system  periodically  or  develop  a 
software  routine  that  recalibrates  the  system  automatically  when  activated. 

TASK  COMBINATIONS 

Overview 

This  section  differs  from  the  preceding  ones  in  that  it  is  concerned 
with  task  combinations  in  general  rather  than  with  a  specific  combination. 
This  section  was  included  because  of  an  increased  interest  in  mul tiple- task 


performance  in  exotic  environments  during  the  last  10  years.  Many  investi¬ 
gators,  however,  have  not  recognized  the  problems  associated  with  construc¬ 
ting  and  measuring  performance  on  task  combinations.  As  a  result,  many 
studies  have  collected  data  that  were  either  uninterpre table  or  unanalyzable. 

Task  combinations  are  usually  described  today  using  Wickens'  Multiple 
Resources  Model.  That  is,  the  combinations  are  described  primarily  in  terms 
of  the  number  and  type  of  resources  that  are  shared.  For  example,  a  combin¬ 
ation  might  be  described  by  stating  that  it  required  shared  visual  re¬ 
sources,  separate  code  of  processing  resources,  separate  stage  of  processing 
resources,  and  shared  manual  response  resources.  Combinations  may  also  be 
described  using  the  third  scheme.  That  is,  the  primary  cognitive  structure 
or  processes  required  by  each  task  is  mentioned. 

An  investigator  has  only  two  reasons  for  including  a  task  combination  in 
a  performance  battery:  (1)  to  measure  timesharing  skills  and  abilities,  or 
(2)  to  make  the  data  more  applicable  to  a  real-world  activity  that  requires 
timesharing.  Interestingly,  almost  100  years  of  experiments  have  failed  to 
isolate  conclusively  a  general  timesharing  ability.  Additionally,  little 
avidence  exists  for  more  specific  timesharing  abilities  (see  reference  67 
for  a  good  literature  review).  Thus,  at  this  time,  no  scientifically  sup¬ 
portable  reason  exists  for  including  a  task  combination  in  a  performance 
battery  to  assess  the  affect  of  some  variable  on  a  timesharing  ability. 

In  contrast,  several  timesharing  skills  have  been  identified  (see  ref¬ 
erences  23  and  6b  for  examples).  Measuring  these  skills  requires  controlled 
laboratory  conditions,  practiced  subjects,  and  a  significant  amount  of 
statistical  analysis.  Even  under  the  best  conditions,  the  scores  obtained 
for  these  skills  are  only  crude  estimates.  Therefore,  including  a  task 
combination  in  a  performance  battery  to  measure  timesharing  skills  seems 
Questionable.  The  only  justifiable  reason  for  including  a  combination  in  a 
battery  is  to  increase  the  applicability  of  the  data  to  a  specific  real- 
world  activity.  In  this  situation,  the  investigator  should  attempt  to 
simulate  the  real-world  activity  as  closely  as  possible  to  ensure  that  the 
same  timesharing  skills  are  required  by  the  laboratory  combination. 

An  investigator  should  keep  in  mind  that  data  obtained  from  timeshared 
tasks  are  usually  difficult  to  analyze  and  may  require  consultation  with  a 
statistician.  Experimental  designs  requiring  repeated  measurement  of  per¬ 
formance  on  timeshared  tasks  usually  produce  the  most  difficult  type  of  data 
to  analyze;  typically,  this  type  of  data  violates  most  of  the  assumptions  of 
analysis  of  variance.  Some  newer  statistical  techniques,  such  as  correcting 
the  repeated  measures  factors  by  adjusting  the  degrees  of  freedom  (69),  do 
offer  solutions  to  some  of  the  problems  encountered  with  data  from  task 
combinations.  These  techniques  do  not,  however,  offer  solutions  to  all  of 
the  problems  likely  to  be  encountered. 

The  investigator  should  also  be  aware  of  several  other  statistical 
problems  that  may  be  encountered.  One  of  these  concerns  the  type  of  analy¬ 
sis  to  be  performed.  If  two  or  more  dependent  measures  are  obtained  from 
one.  of  the  tasks  of  a  combination,  these  measures  probably  will  be  signifi¬ 
cantly  correlated.  Thus,  the  data  should  be  analyzed  using  multivariate, 
rather  than  univariate,  statistics.  My  impression  is  that  the  dependent 
measures  of  a  given  task  will  be  correlated  more  often  when  the  task  is 
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timeshared  than  when  it  is  performed  singly.  The  investigator  should  be 
prepared,  therefore,  to  use  multivariate  statistics. 

Another  problem  concerns  the  use  of  derived  scores,  such  as  proportion 
scores.  For  example,  one  way  to  take  single- task  performance  into  account 
when  analyzing  mul tiple- task  performance  is  to  express  performance  on  a  task 
when  it  is  timeshared  as  a  percentage  or  proportion  of  performance  on  the 
same  task  performed  alone.  The  problem  with  this  approach  is  that  derived 
scores  often  have  unusual  statistical  properties.  When  derived  scores  are 
analyzed,  the  results  are  often  misleading.  I  have  conducted  several  anal¬ 
yses  of  timeshared  data  using  the  raw  data  for  one  analysis  and  derived 
scores  for  the  other  and  found  that  the  two  sets  of  analyses  occasionally 
gave  contradictory  results.  The  best  advice,  then,  is  to  analyze  raw  scores 
and  to  test  all  the  assumptions  of  the  analyses  carefully. 

One  final  problem  the  investigator  should  keep  in  mind  is  that  a  given 
task  combination  may  reach  differential  stability  very  slowly.  If  differ¬ 
ential  stability  is  necessary,  the  investigator  should  pretest  carefully  to 
establish  the  amount  of  practice  required. 

Task  Development 

If  the  investigator  includes  a  combination  in  the  performance  battery 
that  simulates  a  real-world  activity,  then  the  development  of  the  combina¬ 
tion  is  relatively  straightforward.  Otherwise,  the  investigator  must  de¬ 
velop  the  combination.  The  easiest  way  to  develop  a  combination  is  to  use 
Wickens'  Model  to  determine  the  number  of  resources  to  be  shared.  For 
example,  the  investigator  might  want  to  examine  the  effect  of  some  variable 
on  performance  when  ail  of  the  timeshared  tasks  require  the  same  processing 
resources.  Once  the  investigator  has  decided  on  the  number  of  resources  to 
be  shared,  the  type  of  shared  resources  then  must  be  determined.  That  is, 
if  the  processing  code  resources  are  to  be  shared,  the  investigator  must 
decide  to  use  either  spatial  or  verbal  tasks. 

After  determining  the  number  and  type  of  shared  resources,  the  investi¬ 
gator  must  decide  how  to  construct  the  tasks  so  that  the  combination  has  the 
desired  characteristics.  For  example,  suppose  the  investigator  wanted  to 
construct  a  combination  with  no  shared  resources.  Should  the  task  requiring 
spatial  code  resources  use  visual  resources  or  auditory  resources?  Should 
this  task  use  manual  or  vocal  resources? 

To  construct  the  tasks,  the  investigator  should  consider  Wickens'  prin¬ 
ciple  of  S-C-R  compatibility  (see  reference  70  for  a  good,  brief  summary) 
and  the  desired  level  of  timeshared  performance.  Basically,  the  S-C-R 
compatibility  principle  states  that  the  level  of  single- task  performance  on 
a  given  task  depends  on  the  stimulus/ response  configuration.  More  specifi¬ 
cally,  optimal  performance  on  a  task  requiring  spatial  processing  code 
resources  occurs  when  the  stimuli  are  presented  visually  and  the  subject 
responds  manually.  In  contrast,  optimal  performance  on  a  task  requiring 
verbal  processing  code  resources  occurs  when  the  stimuli  are  presented 
auditorily  and  the  subject  responds  vocally. 

Generally,  the  S-C-R  compatibility  principle  has  a  less  powerful  influ¬ 
ence  on  m ul tiple- task  performance  than  the  number  of  shared  resources. 
Nevertheless,  this  principle  can  be  used  to  improve  good  m ul tiple- task 
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performance  and  worsen  bad  mul tiple- task  performance.  For  instance,  assume 
that  an  investigator  wanted  to  construct  a  combination  with  no  shared  re¬ 
sources.  According  to  Wickens*  Model,  such  a  combination  should  result  in 
the  best  timeshared  performance.  This  performance  could,  however,  be  im¬ 
proved  by  using  visual  stimuli  and  manual  responses  for  the  spatial  task  and 
auditory  stimuli  and  vocal  responses  for  the  verbal  task.  Decreasing 
performance  on  a  combination  consisting  of  tasks  that  reouire  the  same  re¬ 
sources  is  more  problematic;  constructing  a  combination  that  allows  the 
subject  to  respond  vocally  to  both  tasks  and  that  presents  stimuli  audi¬ 
torily  for  both  tasks  without  introducing  unique  constraints  on  the  sub¬ 
ject's  behavior  is  difficult.  Thus,  the  investigator  must  be  satisfied  with 
shared  visual  and  manual  resources  regardless  of  the  processing  code  used  by 
the  tasks.  Nevertheless,  the  principle  is  same. 

The  investigator  should  remember  that  predicting  the  absolute  level  of 
timeshared  performance  of  a  task  even  when  its  single- task  performance 
levels  are  known,  is  not  possible.  Wickens'  Model  only  predicts  that  the 
relative  performance  of  a  task  combination  deteriorates  as  the  number  of 
shared  resources  increases;  it  says  nothing  about  absolute  levels.  Thus, 
the  investigator  must  obtain  pretest  data  to  determine  the  level  of  time- 
shared  performance. 

Dependent  Measures 

Performance  under  timeshared  conditions  is  measured  using  the  same 
variables  as  under  single-task  conditions. 

Practical  Problems 

Only  one  software  problem  is  commonly  encountered.  Most  programs  store 
the  subject's  responses  in  a  buffer  until  they  can  be  read  and  recorded.  A 
problem  occurs  under  timeshared  conditions  when  the  subject  responds  simul¬ 
taneously  to  both  tasks.  In  this  situation,  the  program  may  read  only  the 
first  response  and  clear  the  buffer,  deleting  the  second  response.  This 
problem  can  be  circumvented  easily  by  checking  for  several  responses  in  the 
buffer . 

The  only  common  hardware  problem  an  investigator  may  encounter  can  also 
be  easily  circumvented.  Subjects  often  become  very  frustrated  under  time- 
shared  conditions.  As  a  result,  they  may  treat  the  response  apparatus 
roughly.  The  investigator  should  ensure  that  all  the  equipment  can  with¬ 
stand  rough  handling  without  breaking. 

Two  training  problems  are  among  the  most  serious  practical  problems  the 
investigator  will  encounter.  The  first  is  that  the  relation  between  the 
amount  of  practice  a  subject  receives  on  each  task  of  the  combination  and 
subsequent  performance  on  the  combination  is  not  known.  Thus,  establishing 
an  efficient  training  schedule  that  results  in  an  acceptable  level  of  per¬ 
formance  under  timeshared  conditions  requires  extensive  pretesting.  Second, 
several  investigators  have  noted  that  about  5%  ot  normal  adults  do  not  learn 
under  m ul tiple- task  conditions.  That  is,  with  practice  these  individuals’ 
performance  never  improves.  The  major  problem  with  including  this  type  of 
subject  in  an  experiment  is  that  their  performance  has  little  effect  on  the 
mean  group  performance  calculated  on  a  tr ia 1 - by- tr ia 1  basis  but  increases 
the  standard  deviation.  Usually,  the  increase  is  large  enough  to  cause  a 
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violation  of  the  assumption  of  homogeneity  of  variance  in  subsequent  analy¬ 
ses.  Presently,  these  individuals  cannot  be  identified  from  either  their 
single-task  performance  or  their  early  m ul tiple- task  performance,  and  the 
investigator  must  simply  decide  post  hoc  to  include  or  exclude  the  r  data 
from  subsequent  analyses. 

Probably  the  worst  problems  an  investigator  will  encounter  involve 
controlling  the  subjects*  priorities.  Under  any  timeshared  conditions,  a 
subject  will  tend  to  favor  one  task  over  the  other(s).  This  bias  can  be  so 
extreme  that  a  subject  may  actually  ignore  one  task  completely.  The  inves¬ 
tigator's  problem,  therefore,  is  to  control  the  subject's  priorities  in  some 
way  that  reduces  individual  differences.  At  a  minimum,  the  investigator  can 
give  the  subject  explicit  priorities.  This  technique,  of  course,  has  it 
limitations;  most  normal  adults  understand  equal  priority  instructions  but 
may  not  really  understand  instructions  such  as  "Give  45 %  of  your  attention 
to  Task  A  and  65%  to  Task  B."  The  traditional  primary-secondary  task  desig¬ 
nation  also  rarely  works.  This  method  requires  the  subject  to  maintain 
single- task  performance  levels  on  the  primary  task.  The  vast  majority  of 
experiments  using  this  technique  show  that  primary  task  performance  deteri¬ 
orates  in  the  presence  of  the  secondary  task. 

The  best  way  to  control  a  subject's  task  priorities  is  to  use  terminal 
KR  as  described  in  the  Knowledge  of  Results  Section  (Chapter  3),  anu,  if 
necessary,  on-line  KK.  Both  of  these  techniques  have  shortcomings,  as  noted 
earlier,  but  they  are  currently  the  best  methods  available. 

Finally,  the  investigator  should  consider  the  problem  of  fatigue.  Per¬ 
forming  under  timeshared  conditions  for  any  period  of  time  is  usually  fatig¬ 
uing.  Unless  the  purpose  of  the  research  is  to  examine  the  effect  of 
fatigue  on  timeshared  performance,  the  investigator  must  schedule  periodic 
breaks.  These  breaks,  however,  may  not  be  sufficient  to  ensure  acceptable 
performance,  particularly  if  the  subject  is  also  frustrated  with  his/her 
performance.  The  investigator  may  want  to  consider  some  type  of  incentive 
based  on  an  individual's  performance  to  compensate  for  the  effects  of  fa¬ 
tigue  and  frustration. 
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8.  CONCLUDING  REMARKS 

As  stated  in  the  preface,  this  monograph  is  directed  towards  the  con¬ 
struction  of  a  battery  of  human  information  processing  tests  for  use  in 
repeated-measures  testing  situations.  It  attempts  to  provide  some  guide¬ 
lines  for  individuals  who  have  little  knowledge  of  human  information  pro¬ 
cessing  and  the  problems  associated  with  computerized  testing.  It  is  not  an 
exhaustive  discussion  of  all  the  issues  and  tests  an  investigator  must 
consider  in  constructing  a  battery.  Rather,  it  attempts  to  convey  the  type 
of  questions  an  investigator  must  bear  in  mind  while  constructing  the  bat¬ 
tery.  It  also  attempts  to  alert  the  reader  to  some  of  the  common  pitfalls 
associated  with  repeated-measures  testing. 

Some  readers  may  find  this  monograph  too  abstract  for  dealing  with  the 
complicated  and  interrelated  problems  associated  with  the  development  of  a 
battery.  Such  readers  may  find  the  Army  Air  Force  Aviation  Psychology 
Program  Research  reports  valuable.  This  series,  documenting  the  World  War 
II  aircrew  selection  efforts,  provides  a  detailed,  step-by-step  commentary 
on  the  development  of  the  battery.  The  fourth  report  on  apparatus  testing 
(71)  may  be  particularly  useful. 

Finally,  the  reader  should  remember  that  the  methodology  for  repeated- 
measurer  testing  and  the  associated  statistical  tests  are  still  being  devel¬ 
oped.  New  techniques  for  analyzing  repeated  measures  data  probably  will  be 
available  in  a  few  years  that  will  be  more  powerful  than  the  techniques 
currently  available.  The  development  of  increasingly  powerful  microcompu¬ 
ters  will  increase  both  the  number  and  the  types  of  available  tests.  Thus, 
investigators  should  be  alert  for  new  developments  in  methodology,  statis¬ 
tical  techniques,  and  hardware  that  could  be  used  in  repeated-measures 
batteries. 
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GLOSSARY 

Algorithm.  A  rote  or  mechanical  procedure  for  solving  a  problem. 

Assembly  language.  A  low-level  computer  language  one  step  above  the  binary 
machine  language. 

Average  correct  reaction  tine.  The  average  of  the  reaction  times  when  the 

subject-  responder1  '’oriectiy. 

Bandwidth.  The  difference  between  the  frequency  limits  of  a  band  containing 
the  useful  frequency  components  of  a  signal. 

Baseline.  A  measure  of  behavior  under  control  conditions  or  before  the 
experiment  begins.  Later  experimental  treatments  are  expected  to 
modify  the  baseline. 

Be  tween-subjects  design.  A  design  in  which  no  repeated  measures  are 
obtained  on  the  subjects. 

Bimodal  distribution.  A  distribution  that  has  two  distinct  modes. 

Carry-over  effect.  An  effect  that  occurs  in  repeated  measures  experiments 
when  the  administration  of  one  treatment  level  affects  a  subject's 
performance  on  subsequent  levels. 

Compile.  To  prepare  a  machine  language  program  automatically  from  a 

program  written  in  a  higher  programming  language,  usually  generating 
more  than  one  machine  instruction  for  each  symbolic  statement. 

Computerized  test.  Any  test  that  is  presented  using  a  computer,  that  is, 
the  stimuli  are  presented  on  the  CRT  of  the  computer  and  the  computer 
does  all  the  response  recording. 

Constant  mapping.  A  procedure  for  presenting  stimuli  in  the  Sternberg  task 
such  that  memory  set  items  are  never  distracters  and  distracters  are 
never  memory  set  items. 

Counterbalancing.  The  process  of  arranging  a  series  of  experimental 

treatments  in  such  a  way  as  to  minimize  practice  effects,  fatigue,  or 
other  order  effects.  A  simple  form  of  counterbalancing  would  be  to 
administer  two  experimental  conditions  in  the  order  ABBA. 

Cursor.  A  moving  display  element  that  represents  the  system  output  on  a 
pursuit  display.  On  a  compensatory  display  the  cursor  represents  the 
difference  between  the  system  input  and  the  output. 

Debug.  To  test  for,  locate,  and  remove  mistakes  from  a  program  or 
malfunctions  from  a  system. 

Dependent  variable.  The  variable  that  is  observed  and  measured  in  an 

experiment.  The  dependent  variable  is  the  response  predicted  to  change 
as  a  result  of  and  in  relation  to  the  manipulation  of  the  independent 
variable  by  the  investigator. 
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Discrete  task.  A  task  requiring  discrete  responses,  such  as  key  presses. 

Experimental  design.  A  specific  plan  for  assigning  subjects  to  experiment¬ 
al  conditions  and  the  statistical  analysis  associated  with  that  plan. 

Feedback.  Information  provided  by  the  various  sense  organs,  particularly 
information  received  when  a  response  is  made. 

Fit.  To  adjust  a  smooth  curve  of  a  specified  type  to  a  given  set  of  points 
in  such  a  way  as  to  minimize  the  sum  of  the  squares  of  the  distances 
measured  parallel  to  the  axis  of  the  ordinates  from  the  given  points  to 
the  curve. 

Gain.  The  increase  in  signal  power  that  is  produced  by  an  amplifier. 

Hz.  Cycles  per  second. 

Independent  variable.  A  variable  under  control  of  the  investigator. 

Knowledge  of  results.  Augmented  feedback. 

Latin  square.  An  experimental  design  in  which  treatments  are  administered 
in  orders  that  are  systematically  varied. 

Linear  regreaaion.  A  regression  analysis  that  assumes  that  the  predictor 
variable  is  related  to  the  predicted  variable  along  a  straight  line. 

Mixed  design.  A  design  that  contains  both  between-  and  within-subject 
factors. 

Noraal  distribution.  A  bell-shaped  probability  curve  showing  the  expected 
value  of  sampling  a  random  variable.  Also  called  Gaussian  distribution, 
normal  curve,  normal  probability  curve. 

Noraative  score.  A  person's  score  compared  with  the  scores  of  other 
individuals,  such  as  a  percentile  ranking  in  a  particular  group. 

Paced.  Each  stimulus  of  a  test  follows  the  preceding  stimulus  at  a  certain 
interval . 


Paradiga.  A  model,  pattern,  or  design  of  the  functicns  and  interrelations 
of  a  process.  In  psychological  research,  a  paradigm  is  an  experimental 
design  or  plan  of  the  various  steps  of  the  experiment,  or  a  model  of  the 
process  or  behavior  under  study. 

Phase  lag  (lag  augle).  The  negative  of  the  phase  difference  between  a 
sinusoidally  varying  quantity  and  a  reference  quantity,  which  varies 
sinusoidally  at  the  same  frequency,  when  the  phase  difference  is 
nega  tive . 

Power  of  a  statistical  test.  "The  probability  that  the  test  will  yield 
statistically  significant  results,"  (72). 
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Root  mean  square  (RMS)  error.  A  common  measure  of  tracking  performance 

obtained  by  squaring  uncorrected  error  values,  dividing  by  tiie  number  of 
errors,  and  then  taking  the  square  root  of  the  result. 

Sensitivity.  The  extent  to  which  performance  on  a  test  changes  in  response 
to  changes  in  some  variable. 

Sequence  effects.  The  portion  of  the  carry-over  effects  that  depends  on 
the  order  of  specific  treatments. 

Slope.  The  change  in  the  ordinate  of  a  function  divided  by  the  change  in 
the  abscissa. 

Validation.  The  process  of  determining  the  accuracy  of  a  test  in  measuring 
what  it  purports  to  measure. 

Varied  mapping.  A  procedure  for  presenting  stimuli  in  the  Sternberg  task 
such  that  the  memory  set  items  and  the  distracters  are  randomly  inter¬ 
mixed  over  trials. 

Task  analysis.  The  detailed  breakdown  of  a  job  into  its  component  skills, 
required  knowledge,  and  specific  operations. 

Uithiu-subject  design.  A  design  in  which  repeated  measures  are  obtained  on 
all  of  the  independent  (experimental)  variables. 

Z-  score  (standardized  score).  A  score  showing  the  relative  status  of  a 
score  in  a  distribution.  The  mean  of  a  distribution  of  standardized 
scores  is  always  0.0,  and  the  standard  deviation  is  always  1.0. 
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APPENDIX.  DIFFERENTIAL  STABILITY 

Differential  stability  is  concerned  with  performance  in  a  repeated 
measures  situation.  Its  calculation  allows  the  investigator  to  determine 
when  group  performance  has  "stabilized"  according  to  mathematically  defined 
criteria.  Thus,  the  investigator  does  not  have  to  rely  on  "eyeballing"  the 
data  to  determine  when  a  test  has  been  learned  sufficiently  to  avoid  con¬ 
founding  learning  effects  with  experimental  effects  (see  reference  5  for  an 
excellent  example  of  problems  associated  with  estimating  asymptotic  per¬ 
formance  by  "eyeball"). 

To  be  differentially  stable,  performance  on  a  given  test  must  have  three 
characteristics,  which  are  assessed  statistically  with  a  specified  level  of 
experimental  error.  First,  the  means  of  the  dependent  measure  must  either 
be  constant  or  increase  in  a  slow,  linear  fashion.  Second,  the  variances  on 
each  trial  must  be  constant.  Third,  the  rank,  order  of  subjects  must  be 
constant  from  trial  to  trial. 

To  determine  differential  stability,  the  investigator  must  first  calcu¬ 
late  the  intertrial  correlation  matrix.  This  is  a  square  matrix  with  1.00s 
in  the  diagonals.  All  cell  entries  represent  the  correlation  between  per¬ 
formance  on  the  ith  trial  with  performance  on  the  j th  trial.  For  most  tests 
of  interest,  the  intertrial  correlation  matrix  initially  has  what  is  known 
as  superdiagonal  form.  This  form  is  characterized  by  decreasing  correla¬ 
tions  across  any  given  row  and  up  any  given  column.  After  some  amount  of 
practice,  the  superdiagonal  form  disappears,  and  the  correlations  become 
constant  (within  some  specified  error  level).  When  the  superdiagonal  form 
disappears,  performance  on  the  test  has  become  differentially  stable. 

The  trial  on  which  the  test  becomes  differentially  stable  can  be  identi¬ 
fied  by  performing  a  number  of  different  statistical  analyses  on  the  inter¬ 
trial  correlation  matrix.  The  most  common  analyses  are  the  early-versus- 
late  analysis  of  variance,  the  Lawley  Chi  Squared  Test,  and  the  Steiger  Test 
(73,74).  Bittner  (75)  describes  the  first  two  of  these  in  some  detail  and 
discusses  the  advantages  and  disadvantages  of  each.  The  Steiger  Test  has 
not  been  widely  used  to  date  because  it  is  difficult  to  program.  Conse¬ 
quently,  little  information  is  available  on  its  characteristics. 

The  reader  should  note  that  some  tests  never  achieve  differential  sta¬ 
bility  (see  reference  76  for  an  example).  Other  tests  are  stable  imme¬ 
diately;  their  intercorrelation  matrices  never  show  superdiagonal  form. 

Such  tests  usually  either  assess  some  skill  that  has  been  practiced  exten¬ 
sively,  such  as  simple  mental  arithmetic,  or  a  skill  that  is  learned  ex¬ 
tremely  easily,  such  as  turning  a  screw  a  given  number  of  rotations. 


Figure  1.  Schematic  representations  of  tracking  tasks. 

Figure  la  shows  the  preferred  implementation  with  the  forcing  function 
added  to  the  task  after  the  system  dynamics  have  acted  on  the  operator's 
response.  Figure  lb  shows  a  less  desirable  implementation  with  the  forcing 
function  added  to  the  operator's  c.  ;tput  before  the  system  dynamics  have 
acted  on  the  operator's  output. 


A- 2 


